Publishers vs Common Crawl Scraping

Digital Content Next (DCN), a US trade organisation representing digital publishers, has issued a cease and desist notice to the Common Crawl Foundation, demanding it stop collecting publisher content and remove material already stored in its datasets.

The move, reported by DCN chief executive Jason Kint and later covered by Press Gazette, marks a significant escalation in an ongoing debate over how publicly accessible web data is used in large-scale datasets that support AI development.

Common Crawl’s Role in AI Training Data

Common Crawl has been collecting and archiving billions of web pages each month since 2007, building a vast public dataset that is widely used in artificial intelligence training. For example, OpenAI’s GPT-3 research paper indicated that filtered Common Crawl data made up around 60% of its training material.

This makes the dispute particularly relevant for publishers concerned about how their content is used in AI systems. While blocking the Common Crawl bot (CCBot) can prevent future scraping, it does not remove content that has already been stored in its archive, which remains publicly accessible.

What Publishers Are Demanding

In its legal letter, DCN calls on Common Crawl to cease collecting, storing, or distributing copyrighted, paywalled, subscriber-only and otherwise protected material belonging to its member organisations. It also demands the removal of such content from existing datasets.

The group argues that Common Crawl has improperly included copyrighted material in its archives and made it available to third parties, including AI developers.

DCN also disputes the idea that publishers must actively opt out, stating that copyright law should not function on an “opt-out” basis. Instead, it suggests that explicit permission should be required before content is included.

Jason Kint described the issue as a challenge to the assumption that professionally produced content can be freely collected and repurposed simply because it is publicly accessible online.

Concerns Over Data Removal Practices

The letter also raises questions about whether Common Crawl fully respects opt-out requests or consistently removes content when asked. According to Press Gazette, DCN’s legal team is examining whether earlier assurances made by Common Crawl to publishers may have been misleading or incomplete.

Common Crawl maintains a public list of websites that have requested exclusion, including major organisations such as the Associated Press, the BBC, and parts of the News/Media Alliance. However, concerns remain over whether removal is fully effective in practice.

Previous reporting by The Atlantic also suggested that content from outlets such as The New York Times and Danish publishers remained accessible even after removal requests were made.

Common Crawl’s Position

Common Crawl executive director Rich Skrenta has declined to comment directly on the latest legal notice. However, in earlier responses, he has rejected claims that the organisation deliberately includes paywalled content or misleads publishers.

He has explained that the archive’s structure does not allow for simple post-publication edits without compromising its integrity. Instead, removal requests are handled through filtering processes that prevent the data from appearing in future distributions or public tools.

According to Skrenta, the organisation acts promptly on removal requests but has always acknowledged that the process is technically complex and not instantaneous.

He has also said that Common Crawl is involved in developing open standards to allow websites to signal their preferences around AI scraping more clearly.

Why This Dispute Matters

At the heart of the disagreement is whether existing archived content should remain accessible once a publisher opts out, or whether inclusion should require prior consent.

Research cited by industry groups suggests that a large proportion of publishers are already blocking AI-related crawlers, with many seeking to restrict the use of their content in training datasets. However, critics argue that blocking future access does not resolve the issue of previously collected material still being used.

Outlook

The outcome of the dispute will depend on how Common Crawl responds to DCN’s demands. At present, both sides appear to be advocating fundamentally different approaches to content governance.

Common Crawl continues to support opt-out mechanisms and broader industry standards for signalling scraping preferences, while DCN is pushing for a model based on explicit permission before content is collected at all.

If other publisher groups adopt similar positions, the debate may shift from individual crawler controls to the legality of large-scale web archives themselves.