Duplicate content filtering is designed to prevent the irritating happenstance of the same content being presented to users in all or most of the top 10 positions in search results.
A common mistake when trying to deal with duplicate content filtering is thinking that there is only one main duplicate filter. This is incorrect; there are multiple duplicate filters which are applied in order during the three sectors of the search engine process:
- Spidering (crawling)
- Querying / SERPs
The first duplicate content filters eliminate content before Web pages are indexed; this means such duplicate content will not be displayed in search results. A Web page will not appear on the SERPs until it is in a search engine index, so the crawl-time filters exclude URLs from being added.
The second set of duplicate content filters apply after pages are added to the search engine index. Web pages are available to rank, but don’t always display in the SERP. Sometimes they will show as a Supplemental Index. This does not mean a site has been penalized; simply that that page has such a low PageRank that it is appearing in the Supplemental Index.
Matt Cutts’ says:
Having urls in the supplemental results doesn’t mean that you have some sort of penalty at all; the main determinant of whether a url is in our main web index or in the supplemental index is PageRank. If you used to have pages in our main web index and now they’re in the supplemental results, a good hypothesis is that we might not be counting links to your pages with the same weight as we have in the past. The approach I’d recommend in that case is to use solid white-hat SEO to get high-quality links (e.g. editorially given by other sites on the basis of merit).
The third filter is one you can apply yourself.
301 redirects can help with duplicate content issues being brought up because of different versions of home pages, etc:
You can use 301 redirects to point to the most appropriate url for inclusion. By making a choice, you prevent the search engines from making a choice for you.
You can also utilize the robots exclusion protocol in some duplicate content cases; this works well for site redesigns when pages are abandoned but not removed due to their link juice value. Using a 301 redirect to the home page may frustrate the user; a custom 404 page with several options avail;able for the user is more effective.
You can simply exclude the page from indexing, and all will be well. These two methods are the most effective when dealing with duplicate content on your site. Tomorrow we can look at duplicate content across the web.