Human vs machine intelligence: How to win when ‘duplicate’ content is unique
As impressive as machine learning and algorithm-based intelligence can be, they often lack something that comes naturally to humans: common sense.
It’s common knowledge that putting the same content on multiple pages produces duplicate content. But what if you create pages about similar things, with differences that matter? Algorithms flag them as duplicates, though humans have no problem telling pages like these apart:
- E-commerce: similar products with multiple variants or critical differences
- Travel: hotel branches, destination packages with similar content
- Classifieds: exhaustive listings for identical items
- Business: pages for local branches offering the same services in different regions
How does this happen? How can you spot issues? What can do you about it?
The danger of duplicate content
Duplicate content interferes with your ability to make your site visible to search users through:
- Loss of ranking for unique pages that unintentionally compete for the same keywords
- Inability to rank pages in a cluster because Google chose one page as a canonical
- Loss of site authority for large quantities of thin content
How machines identify duplicate content
Google uses algorithms to determine whether two pages or parts of pages are duplicate content, which Google defines as content that is “appreciably similar“.
Google’s similarity detection is based on their patented Simhash algorithm, which analyzes blocks of content on a web page. It then calculates a unique identifier for each block, and composes a hash, or “fingerprint”, for each page.
Because the number of webpages is colossal, scalability is key. Currently, Simhash is the only feasible method for finding duplicate content at scale.
Smash fingerprints are:
- Inexpensive to calculate. They are established in a single crawl of the page.
- Easy to compare, thanks to their fixed length.
- Able to find near-duplicates. They equate minor changes on a page with minor changes in the hash, unlike many other algorithms.
This last means that the difference between any two fingerprints can be measured algorithmically and expressed as a percentage. To reduce the cost of evaluating every single pair of pages, Google employs techniques such as:
- Clustering: by grouping sets of sufficiently similar pages together, only fingerprints within a cluster need to be compared, since everything else is already classified as different.
- Estimations: for exceptionally large clusters, an average similarity is applied after a certain number of fingerprint pairs are calculated.
Finally, Google uses a weighted similarity rate that excludes certain blocks of identical content (boilerplate: header, navigation, sidebars, footer; disclaimers…). It takes into account the subject of the page using n-gram analysis to determine which words on the page occur most frequently, and – in the context of the site – are most important.
Analyzing duplicate content with Smash
We’ll be looking at a map of content clusters flagged as similar using Smash. This chart from OnCrawl overlays an analysis of your duplicate content strategy on clusters of duplicate content.
OnCrawl’s content analysis also includes similarity ratios, content clusters, and n-gram analysis. OnCrawl is also working on an experimental heatmap indicating similarity per content block that can be overlaid on a webpage.
Validating clusters with canonicals
Using canonical URLs to indicate the main page in a group of similar pages is a way of intentionally clustering pages. Ideally, the clusters created by canonicals and those established by Simhash should be identical.
Solving duplicate content problems for unique content
There’s no satisfying trick to correct a machine’s view of unique pages that appear to be duplicates: we can’t change how Google identifies duplicate content. However, there are still solutions to align your perception of unique content and Google’s… while still ranking for the keywords you need.
The future of duplicate content
Google’s ability to understand the content of a page is constantly evolving. With the increasingly precise ability to identify boilerplate and to differentiate between intent on web pages, unique content identified as a duplicate should eventually become a thing of the past.
Until then, understanding why your content looks like duplicates to Google, and adapting it to convince Google otherwise, are the keys to successful SEO (Search Engine Optimization) for similar pages.
It’s common knowledge that putting the same content on multiple pages produces duplicate content. Great post!!Local SEO company in Dillon
ReplyDeleteAmazing article! I have some doubts on duplicate content but your article has clear my all doubt. Thanks!
ReplyDeleteInformative article!! Easy to compare, thanks to their fixed length.
ReplyDeletePublishing the same content on multiple pages produces duplicate content.
ReplyDelete