What is Duplicate Content?

January 26th, 2009 by Linda Bustos

If a search engine returned a bunch of results for your search query that were all pretty much the same, you would get a bit irritated, wouldn’t you? Search engines understand this, and have tweaked their algorithms to filter very similar pages so you get a range of results that are relevant, but still different enough.

According to Google, duplicate content generally refers to blocks of content on one website or across multiple domains that are either identical or almost identical. For example, a product manufacturer may issue a product description which is used by the majority of its resellers across their websites (across domains). Or, the reseller may have that product description appear on multiple pages (with different URLs) on its own site, perhaps under several categories like Power Tools, Specials and Top Rated.

Also, affiliates may link to a manufacturer’s site with an affiliate tracking ID at the end, like www.manufacturersite.com/product-category/product-ID?=1234 which creates a copy of the page in the search engine index when the search engine follows the link on the affiliate page (because that’s how search engines find pages, by following links!)

Duplicate content is usually unintentional, like the manufacturer example above. In other cases, content is being “scraped” (stolen) and republished across spam sites, or being duplicated by shady search engine optimizers across multiple domains to manipulate search rankings. Translated pages and the occasional block of text (like a shipping policy) repeated across a site are not considered duplicate content.

Google and other search engines want to deliver a variety of useful results, and apply filters to the duplicate content in their indexes as best they can. Why this is a problem is:

a. When you have the same content as other domains, Google may not even display your page. The more sites have the same content, the lower your chances of appearing.
b. When you have duplicate content on your own site, you can’t control which page Google picks for results – it may be your print-friendly page, for example.

There is another way duplicate content can hurt your search engine rankings – it can dilute your “link juice.” Incoming links are very important for search engine rankings — and perhaps the most important factor when Google determines search rankings. Inbound links are like “votes” from other sites that tell a search engine your page is worth linking to (and worth reading). When a search engine sees 2 pages with the same or near-identical content, it may filter the one with less links pointing to it.

It’s not uncommon for a site to have multiple copies of its home page accessible to visitors and indexed in search engines, for example:

http://yoursite.com
http://www.yoursite.com
http://www.yoursite.com/en/home
http://www.yoursite.com/

These are all different URLs, and each count as different URLs. If some sites link to your www.yoursite.com version, some to your non-www and so on, links to essentially the same page are split across 4 URLs. So what could have been 8 links pointing to one page could be 2 links pointing to 4. And Google will only display one in search engines, so it picks one, and attributes only 2 links to it.

301 Redirects to the Rescue

You should choose one URL and set up 301 permanent redirects to tell search engines to pass the link credit from all pages that redirect to the page you choose (essentially add the links together).

Which version should you redirect to? It’s up to you, it really doesn’t matter, as long as you pick one. You should set this up at the site level so all links to deeper pages on your site will redirect from www to non-www or vice versa also. You can set this up in your .htaccess file.

Here are some other tips from Google

  • Block appropriately: Rather than letting our algorithms determine the “best” version of a document, you may wish to help guide us to your preferred version. For instance, if you don’t want us to index the printer versions of your site’s articles, disallow those directories or make use of regular expressions in your robots.txt file.
  • Use TLDs: To help us serve the most appropriate version of a document, use top level domains whenever possible to handle country-specific content. We’re more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.
  • Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we’ll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer.
  • Use the preferred domain feature of webmaster tools: If other sites link to yours using both the www and non-www version of your URLs, you can let us know which way you prefer your site to be indexed.
  • Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
  • Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren’t subjected to a zillion instances of “Below you’ll find a superb list of all the great rental opportunities in [insert cityname]…” with no actual listings.
  • Understand your CMS: Make sure you’re familiar with how content is displayed on your Web site, particularly if it includes a blog, a forum, or related system that often shows the same content in multiple formats.
  • Don’t worry be happy: Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and have us deal with the rogue site.

Linda Bustos is an eCommerce Analyst for Elastic Path Software, an enterprise ecommerce framework. Linda blogs daily about Internet Marketing for online retail at the Get Elastic eCommerce Blog.

7 Comments

February 2nd, 2009 by james s (not verified)

In regards to the duplicate content part of this blog post, I personally use the http://www.copygator.com website to find and stop duplicate content:

1. it's automated and brings me results instead of me searching for duplicated content. All i had to do was submit my feed and it started monitoring my feed showing me who's republished my articles on the web.

2. i get notified by email so it contacts me when it finds copies of my articles online.

3. i use their image badge feature to alert me directly on my website when my content is being lifted.

4. it's a free service as opposed the "per page" cost of copyscape/copysentry.

February 5th, 2009 by Daryl James (not verified)

@James S…thanks for the copygator heads up.

@Linda, very good article. I never really paid much attention to the home page issue you mention. Thanks.

February 18th, 2009 by David Gardner (not verified)

How about dupecopdotcom? Is it a decent article checker? I’ve been using copygator for a while though. Just wondering aloud. Anyways, great post Linda. Cheers! ;-)

David

June 8th, 2009 by Stephanie (not verified)

great article…..keep up the good work

July 11th, 2009 by Biz (not verified)

I do occasionally publish my articles on isnare and ezine articles? Should I stop doing this?

July 13th, 2009 by cabal (not verified)

Lately it seems there has been an increase in datafeed driven/affiliate content sites out there. I myself have made quite a few. I have also seen the issue of what exactly is duplicate content discussed a few times recently.

We all know Google says that duplicate content is a “don’t” and as such you risk being banned or penalized for doing it. But what exactly is duplicate content? It isn’t just affiliate datafeed sites, such as those using Amazon AWS, that have duplicate content. People often create sites using feeds from Wikipedia and DMOZ, is this duplicate content? You could find a press release from Tivo on thousands of news, financial, or electronics websites. Is that duplicate content? What about game cheat sites that all list the same cheats?

I think we can all agree that when a single individual or business owns two websites with the exact same content that it is spam. But what about the thousands of websites owned by different people that all use the same content? Amazon AWS (Amazon Web Services) sites are not unique, they only offer affiliate content, and thus it’d seem Google would like to get rid of them in favor of listings for Amazon.com. In this situation it is easy to figure out who should get listed because there is a parent company everyone is affiliates with.

What about game cheat sites though? If you wanted to get rid of all the duplicate content how do you decide which one stays? DMOZ editors have faced this issue for a long time. You have two sites with the same content, which one is listed? My solution when I was an editor was to list them both, the reason is that maybe one site might be down when a user tries to visit it, so a certain amount of redundancy makes the directory more useful.

New datafeed enabled affiliate programs show up every day, as do new datafeed driven websites. Eventually there will be too many, search engines will have to do something, but what? There will be too many for manual review, and any automatic system could hurt other sites with duplicate content such as news sites and game cheat sites, etc. You might be able to write an algorithm that detects most Amazon AWS sites, but what about the thousands of other affiliate programs out there? And even then you’re still just getting most of the websites. People will find a way around any filters.

July 16th, 2009 by Linda Bustos (IXM)

@Biz

If you use the article reprint sites, make sure you put the article on your site or blog first, and encourage the reprint publishers to link back to your site. This way the search engines are far more likely to rank your site for the content than other sites (links pointing back to you, and your site's copy the oldest in its index).

@caval

Duplicate content will not get you banned, the issue is you may have ranking problems due to the duplicate content filter. It's harder to rank your domain when the same content exists across multiple domains, and you may have an issue with how deep search engines decide to crawl your site when you have excess pages or a lot of duplicate content. But you won't receive a penalty in that your site is removed completely from the index -- with all its pages.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li>
  • Lines and paragraphs break automatically.
  • Each email address will be obfuscated in a human readable fashion or (if JavaScript is enabled) replaced with a spamproof clickable link.

More information about formatting options

About the Author

Linda Bustos
eCommerce Analyst
Read More

About ImageX Media

We're a Drupal design and development firm based in Vancouver BC. We're passionate about making sites that aren't just great looking, but work great too. Learn More