What is Duplicate Content?

Jan 26 2009

If a search engine returned a bunch of results for your search query that were all pretty much the same, you would get a bit irritated, wouldn’t you? Search engines understand this, and have tweaked their algorithms to filter very similar pages so you get a range of results that are relevant, but still different enough.

According to Google, duplicate content generally refers to blocks of content on one website or across multiple domains that are either identical or almost identical. For example, a product manufacturer may issue a product description which is used by the majority of its resellers across their websites (across domains). Or, the reseller may have that product description appear on multiple pages (with different URLs) on its own site, perhaps under several categories like Power Tools, Specials and Top Rated.

Also, affiliates may link to a manufacturer’s site with an affiliate tracking ID at the end, like www.manufacturersite.com/product-category/product-ID?=1234 which creates a copy of the page in the search engine index when the search engine follows the link on the affiliate page (because that’s how search engines find pages, by following links!)

Duplicate content is usually unintentional, like the manufacturer example above. In other cases, content is being “scraped” (stolen) and republished across spam sites, or being duplicated by shady search engine optimizers across multiple domains to manipulate search rankings. Translated pages and the occasional block of text (like a shipping policy) repeated across a site are not considered duplicate content.

Google and other search engines want to deliver a variety of useful results, and apply filters to the duplicate content in their indexes as best they can. Why this is a problem is:

a. When you have the same content as other domains, Google may not even display your page. The more sites have the same content, the lower your chances of appearing.
b. When you have duplicate content on your own site, you can’t control which page Google picks for results – it may be your print-friendly page, for example.

There is another way duplicate content can hurt your search engine rankings – it can dilute your “link juice.” Incoming links are very important for search engine rankings — and perhaps the most important factor when Google determines search rankings. Inbound links are like “votes” from other sites that tell a search engine your page is worth linking to (and worth reading). When a search engine sees 2 pages with the same or near-identical content, it may filter the one with less links pointing to it.

It’s not uncommon for a site to have multiple copies of its home page accessible to visitors and indexed in search engines, for example:

http://yoursite.com
http://www.yoursite.com
http://www.yoursite.com/en/home
http://www.yoursite.com/

These are all different URLs, and each count as different URLs. If some sites link to your www.yoursite.com version, some to your non-www and so on, links to essentially the same page are split across 4 URLs. So what could have been 8 links pointing to one page could be 2 links pointing to 4. And Google will only display one in search engines, so it picks one, and attributes only 2 links to it.

301 Redirects to the Rescue

You should choose one URL and set up 301 permanent redirects to tell search engines to pass the link credit from all pages that redirect to the page you choose (essentially add the links together).

Which version should you redirect to? It’s up to you, it really doesn’t matter, as long as you pick one. You should set this up at the site level so all links to deeper pages on your site will redirect from www to non-www or vice versa also. You can set this up in your .htaccess file.

Here are some other tips from Google

  • Block appropriately: Rather than letting our algorithms determine the “best” version of a document, you may wish to help guide us to your preferred version. For instance, if you don’t want us to index the printer versions of your site’s articles, disallow those directories or make use of regular expressions in your robots.txt file.
  • Use TLDs: To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We’re more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
  • Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we’ll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer.
  • Use the preferred domain feature of webmaster tools: If other sites link to yours using both the www and non-www version of your URLs, you can let us know which way you prefer your site to be indexed.
  • Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
  • Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren’t subjected to a zillion instances of “Below you’ll find a superb list of all the great rental opportunities in [insert cityname]…” with no actual listings.
  • Understand your CMS: Make sure you’re familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
  • Don’t worry be happy: Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.

Linda Bustos is an eCommerce Analyst for Elastic Path Software, an enterprise ecommerce framework. Linda blogs daily about Internet Marketing for online retail at the Get Elastic eCommerce Blog.

Learn from us
Sign up and receive our monthly insights directly in your inbox!

Subcribe to newsletter (no spam)

Fields