Human vs machine intelligence: How to win when ‘duplicate’ content is unique

July 16, 2019

As impressive as machine learning and algorithm-based intelligence can be, they often lack something that comes naturally to humans: common sense.

It’s common knowledge that putting the same content on multiple pages produces duplicate content. But what if you create pages about similar things, with differences that matter? Algorithms flag them as duplicates, though humans have no problem telling pages like these apart:

E-commerce: similar products with multiple variants or critical differences
Travel: hotel branches, destination packages with similar content
Classifieds: exhaustive listings for identical items
Business: pages for local branches offering the same services in different regions

How does this happen? How can you spot issues? What can do you about it?

The danger of duplicate content

Duplicate content interferes with your ability to make your site visible to search users through:

Loss of ranking for unique pages that unintentionally compete for the same keywords
Inability to rank pages in a cluster because Google chose one page as a canonical
Loss of site authority for large quantities of thin content

How machines identify duplicate content

Google uses algorithms to determine whether two pages or parts of pages are duplicate content, which Google defines as content that is “appreciably similar“.

Google’s similarity detection is based on their patented Simhash algorithm, which analyzes blocks of content on a web page. It then calculates a unique identifier for each block, and composes a hash, or “fingerprint”, for each page.

Because the number of webpages is colossal, scalability is key. Currently, Simhash is the only feasible method for finding duplicate content at scale.

Smash fingerprints are:

Inexpensive to calculate. They are established in a single crawl of the page.
Easy to compare, thanks to their fixed length.
Able to find near-duplicates. They equate minor changes on a page with minor changes in the hash, unlike many other algorithms.

This last means that the difference between any two fingerprints can be measured algorithmically and expressed as a percentage. To reduce the cost of evaluating every single pair of pages, Google employs techniques such as:

Clustering: by grouping sets of sufficiently similar pages together, only fingerprints within a cluster need to be compared, since everything else is already classified as different.
Estimations: for exceptionally large clusters, an average similarity is applied after a certain number of fingerprint pairs are calculated.

Finally, Google uses a weighted similarity rate that excludes certain blocks of identical content (boilerplate: header, navigation, sidebars, footer; disclaimers…). It takes into account the subject of the page using n-gram analysis to determine which words on the page occur most frequently, and – in the context of the site – are most important.

Analyzing duplicate content with Smash

We’ll be looking at a map of content clusters flagged as similar using Smash. This chart from OnCrawl overlays an analysis of your duplicate content strategy on clusters of duplicate content.

OnCrawl’s content analysis also includes similarity ratios, content clusters, and n-gram analysis. OnCrawl is also working on an experimental heatmap indicating similarity per content block that can be overlaid on a webpage.

Validating clusters with canonicals

Using canonical URLs to indicate the main page in a group of similar pages is a way of intentionally clustering pages. Ideally, the clusters created by canonicals and those established by Simhash should be identical.

Solving duplicate content problems for unique content

There’s no satisfying trick to correct a machine’s view of unique pages that appear to be duplicates: we can’t change how Google identifies duplicate content. However, there are still solutions to align your perception of unique content and Google’s… while still ranking for the keywords you need.

The future of duplicate content

Google’s ability to understand the content of a page is constantly evolving. With the increasingly precise ability to identify boilerplate and to differentiate between intent on web pages, unique content identified as a duplicate should eventually become a thing of the past.

Until then, understanding why your content looks like duplicates to Google, and adapting it to convince Google otherwise, are the keys to successful SEO (Search Engine Optimization) for similar pages.

Comments

Anonymous17 July 2019 at 05:00
It’s common knowledge that putting the same content on multiple pages produces duplicate content. Great post!!Local SEO company in Dillon
ReplyDelete
Replies
SEO & Web Development Company18 September 2019 at 04:16
Amazing article! I have some doubts on duplicate content but your article has clear my all doubt. Thanks!
ReplyDelete
Replies
Web designer in Silverthorne CO25 December 2019 at 22:54
Informative article!! Easy to compare, thanks to their fixed length.
ReplyDelete
Replies
Web Development Company in Avon15 January 2020 at 01:48
Publishing the same content on multiple pages produces duplicate content.
ReplyDelete
Replies

Add comment

Search This Blog

William Exchange