Using Compression Techniques to Identify Low-Quality Pages

Compression can be a valuable tool for search engines to identify low-quality pages. While this concept may not be commonly discussed, it is important for SEO professionals to understand its implications.

The idea of using compressibility as a quality signal is not well known, yet it can help search engines detect duplicate content, doorway pages with similar material, and pages filled with repetitive keywords. This knowledge is crucial for optimising web pages effectively.

A research paper has shown that on-page features can be effectively used to spot spam content. However, due to the lack of transparency from search engines, it’s challenging to confirm whether these or similar techniques are actively employed in practice.

What Is Compressibility?

In computing, compressibility refers to the extent to which a file can be reduced in size without losing important information. This process is commonly used to optimise storage space and facilitate the transmission of more data over the Internet.

TL/DR Of Compression

Compression works by substituting repeated words and phrases with shorter references, which can significantly decrease file size. Search engines often compress indexed web pages to optimise storage, minimise bandwidth usage, and enhance retrieval speed, among other benefits.

This is a simplified explanation of how compression works:

Identify Patterns: A compression algorithm examines the text to detect repeated words, patterns, and phrases.

Shorter Codes Take Up Less Space: The codes and symbols generated take up less storage space than the original words and phrases, leading to a reduced file size.

Shorter References Use Fewer Bits: The “code” that represents the replaced words and phrases consumes less data than the originals.

An added benefit of compression is its ability to help identify duplicate pages, doorway pages with similar content, and pages containing repetitive keywords.

Research Paper About Detecting Spam

This research paper holds importance due to its authors, who are notable computer scientists with significant contributions to AI, distributed computing, information retrieval, and other areas.

Marc Najork
One of the co-authors, Marc Najork, is a leading research scientist and currently serves as a Distinguished Research Scientist at Google DeepMind. He has co-authored papers on TW-BERT and has focused on improving the accuracy of using implicit user feedback, such as clicks. Additionally, he has worked on enhancing AI-based information retrieval methods, including DSI++: Updating Transformer Memory with New Documents, contributing to many advancements in the field.

Dennis Fetterly
Another co-author, Dennis Fetterly, works as a software engineer at Google. He is recognized as a co-inventor on a patent for a ranking algorithm that incorporates links and has made contributions to research in distributed computing and information retrieval.

These two researchers are part of a team behind a 2006 Microsoft research paper that explores identifying spam through on-page content features. One of the features examined in the paper is compressibility, which the authors found can serve as a classifier to indicate potentially spammy web pages.

Detecting Spam Web Pages Through Content Analysis

Although this research paper was published in 2006, its insights are still relevant today. At that time, as now, many were trying to rank numerous location-based web pages that featured mostly duplicate content, differing only in the names of cities, regions, or states. Similarly, SEOs frequently created web pages aimed at search engines by excessively repeating keywords in titles, meta descriptions, headings, internal anchor text, and the content itself to enhance rankings.

In Section 4.6 of the paper, the authors note:

“Some search engines give higher weight to pages containing the query keywords several times. For example, for a given query term, a page that contains it ten times may be ranked higher than a page that contains it only once. To exploit such engines, some spam pages replicate their content multiple times in an attempt to rank higher.”

The research explains that search engines compress web pages and use the compressed version as a reference for the original page. They highlight that an excessive amount of repetitive words leads to a higher level of compressibility. This prompted them to investigate whether there is a connection between high compressibility and spam content.

They state:

“Our approach in this section to locating redundant content within a page is to compress the page. To save space and disk time, search engines often compress web pages after indexing them but before adding them to a page cache.

…We measure the redundancy of web pages by the compression ratio, which is the size of the uncompressed page divided by the size of the compressed page. We used GZIP to compress pages, a fast and effective compression algorithm.”

High Compressibility Correlates To Spam

The research findings indicated that web pages with a compression ratio of 4.0 or higher were often low-quality or spam pages. However, the highest compressibility rates were less reliable due to a smaller number of data points, making them more challenging to interpret.

The researchers reached the following conclusion:

“70% of all sampled pages with a compression ratio of at least 4.0 were identified as spam.”

However, they also found that relying solely on the compression ratio led to false positives, where some non-spam pages were mistakenly classified as spam:

“The compression ratio method, as described in Section 4.6, performed best, accurately identifying 660 (27.9%) of the spam pages in our sample, while misclassifying 2,068 (12.0%) of all pages assessed.

When combining all the features mentioned, the classification accuracy from the ten-fold cross-validation process was promising:

95.4% of our assessed pages were correctly classified, while 4.6% were misclassified.

Specifically, for the spam category, 1,940 out of 2,364 pages were correctly identified. For the non-spam category, 14,440 out of 14,804 pages were accurately classified, resulting in 788 pages being incorrectly classified.”

The following section discusses a notable finding on how to enhance the accuracy of using on-page signals for spam identification.

Insight Into Quality Rankings

The research paper explored various on-page signals, including compressibility. It found that while each individual signal could identify some spam, relying on any single signal often led to non-spam pages being incorrectly flagged as spam, commonly known as false positives.

A key finding of the research, which is essential for anyone involved in SEO, is that using multiple classifiers improved the accuracy of spam detection and reduced the chances of false positives. Additionally, the compressibility signal only identifies a specific type of spam and does not cover the entire spectrum of spam.

The main takeaway is that while compressibility is effective for identifying a particular type of spam, it does not capture all forms of spam, indicating that other methods are necessary for comprehensive detection.

It’s crucial for every SEO and publisher to understand the following point:

“In the previous section, we outlined several heuristics for assessing spam web pages. We measured various characteristics of these pages and identified ranges that correlated with spam. However, when used in isolation, no single method was able to identify the majority of spam without also flagging a significant number of non-spam pages as spam.

For instance, the compression ratio heuristic discussed in Section 4.6 was one of our more effective methods, showing that the average probability of a page being spam with a ratio of 4.2 or higher is 72%. However, only about 1.5% of all pages fall into this category, which is considerably lower than the 13.8% of spam pages identified in our dataset.”

Thus, while compressibility was one of the stronger indicators of spam, it still could not detect the entire range of spam present in the researchers’ dataset.

Combining Multiple Signals

The results indicated that individual signals of low quality are not very accurate. To address this, the researchers tested the effectiveness of using multiple signals. They found that combining several on-page signals for detecting spam led to improved accuracy and reduced the number of pages misclassified as spam.

The researchers explained their approach to using multiple signals:

“One way to combine our heuristic methods is to frame the spam detection issue as a classification problem. In this context, we aim to develop a classification model (or classifier) that can utilise the features of a web page to accurately classify it as either spam or non-spam.”

Here are their conclusions regarding the use of multiple signals:

“We examined various aspects of content-based spam on the web using a real-world dataset from the MSNSearch crawler. We presented several heuristic methods for detecting content-based spam. While some of our spam detection methods are more effective than others, relying on any single method may not capture all spam pages. Therefore, we combined our detection methods to create a highly accurate C4.5 classifier. This classifier successfully identifies 86.2% of all spam pages, while misclassifying very few legitimate pages as spam.”

Takeaways

While it’s unclear if compressibility is directly used by search engines, it can serve as a useful signal, especially when combined with other factors, to identify basic types of spam, such as numerous doorway pages that feature similar content related to city names. Even if search engines do not rely on this signal, it highlights how straightforward it is to detect such manipulative practices, which search engines are now well-equipped to manage.

Here are the main points to remember from this article:

– Doorway pages with duplicate content are easy to identify because they tend to have a higher compression ratio than standard web pages.

– Web pages with a compression ratio above 4.0 are mostly spam.

– Using negative quality signals alone to identify spam can lead to false positives.

– The study showed that on-page negative quality signals only detect certain types of spam.

– When used in isolation, the compressibility signal primarily identifies redundancy-related spam, missing other spam types and resulting in false positives.

– Combining multiple quality signals enhances the accuracy of spam detection and reduces false positives.

– Modern search engines, aided by AI tools like Spam Brain, demonstrate improved spam detection capabilities.