If a search engine returned a bunch of results for your search query that were all pretty much the same, you would get a bit irritated, wouldn’t you? Search engines understand this, and have tweaked their algorithms to filter very similar pages so you get a range of results that are relevant, but still different enough.
According to Google, duplicate content generally refers to blocks of content on one website or across multiple domains that are either identical or almost identical. For example, a product manufacturer may issue a product description which is used by the majority of its resellers across their websites (across domains). Or, the reseller may have that product description appear on multiple pages (with different URLs) on its own site, perhaps under several categories like Power Tools, Specials and Top Rated.
Also, affiliates may link to a manufacturer’s site with an affiliate tracking ID at the end, like www.manufacturersite.com/product-category/product-ID?=1234 which creates a copy of the page in the search engine index when the search engine follows the link on the affiliate page (because that’s how search engines find pages, by following links!)
Duplicate content is usually unintentional, like the manufacturer example above. In other cases, content is being “scraped” (stolen) and republished across spam sites, or being duplicated by shady search engine optimizers across multiple domains to manipulate search rankings. Translated pages and the occasional block of text (like a shipping policy) repeated across a site are not considered duplicate content.
Google and other search engines want to deliver a variety of useful results, and apply filters to the duplicate content in their indexes as best they can. Why this is a problem is:
a. When you have the same content as other domains, Google may not even display your page. The more sites have the same content, the lower your chances of appearing.
b. When you have duplicate content on your own site, you can’t control which page Google picks for results – it may be your print-friendly page, for example.
There is another way duplicate content can hurt your search engine rankings – it can dilute your “link juice.” Incoming links are very important for search engine rankings — and perhaps the most important factor when Google determines search rankings. Inbound links are like “votes” from other sites that tell a search engine your page is worth linking to (and worth reading). When a search engine sees 2 pages with the same or near-identical content, it may filter the one with less links pointing to it.
It’s not uncommon for a site to have multiple copies of its home page accessible to visitors and indexed in search engines, for example:
http://yoursite.com
http://www.yoursite.com
http://www.yoursite.com/en/home
http://www.yoursite.com/
These are all different URLs, and each count as different URLs. If some sites link to your www.yoursite.com version, some to your non-www and so on, links to essentially the same page are split across 4 URLs. So what could have been 8 links pointing to one page could be 2 links pointing to 4. And Google will only display one in search engines, so it picks one, and attributes only 2 links to it.
301 Redirects to the Rescue
You should choose one URL and set up 301 permanent redirects to tell search engines to pass the link credit from all pages that redirect to the page you choose (essentially add the links together).
Which version should you redirect to? It’s up to you, it really doesn’t matter, as long as you pick one. You should set this up at the site level so all links to deeper pages on your site will redirect from www to non-www or vice versa also. You can set this up in your .htaccess file.
Here are some other tips from Google
Linda Bustos is an eCommerce Analyst for Elastic Path Software, an enterprise ecommerce framework. Linda blogs daily about Internet Marketing for online retail at the Get Elastic eCommerce Blog.
imagex_media: RT @Dries: Blog post: Drupal trademark policy: update after 11 months http://bit.ly/c2cKjE
imagex_media: How to Stay in a Web Agency's Good Books as a Freelancer http://bit.ly/c3hHdL #Drupal #contractors #jobs
imagex_media: New #Drupal SEO Module: ContentOptimzer http://bit.ly/byP5zA Review page content against keywords you input. Via @thomjjames
imagex_media: RT @alexventpap: 100 Best fonts ever: http://bit.ly/7S2Ux #fonts #type #design
imagex_media: RT @lisarex: Update on the drupal.org redesign: http://drupal.org/node/859916