Recent research by BuzzStream shows that most leading news publishers are actively restricting how AI systems access their content. While the focus is often on blocking AI training, the data reveals that many publishers are also preventing AI tools from retrieving content for live answers and citations.

BuzzStream reviewed the robots.txt files of 100 major news websites across the UK and the US. The analysis found that 79% of these sites block at least one AI training bot. More significantly, 71% also block retrieval bots, which are responsible for pulling content in real time when users ask questions through AI-powered search tools.

This distinction matters. Training bots collect data to build large language models, while retrieval bots decide which sources appear in AI-generated answers. Blocking retrieval bots can mean a publisher’s content is excluded from AI citations altogether, even if the model was trained on it previously.

How the Study Was Conducted

BuzzStream selected the top 50 news sites in both the UK and US based on Similarweb traffic data, then removed duplicates to create a final list of 100 publishers. Bots were grouped into three categories: training bots, retrieval or live search bots, and indexing bots.

Training Bots Face Widespread Blocks

Among training bots, Common Crawl’s CCBot was blocked by 75% of sites, making it the most restricted. Anthropic-ai followed closely at 72%, with ClaudeBot blocked by 69% and OpenAI’s GPTBot by 62%.

Google-Extended, which is used to train Google’s Gemini models, was the least restricted training bot overall. Only 46% of sites blocked it. However, there was a notable regional difference. US publishers blocked Google-Extended 58% of the time, compared with just 29% among UK publishers.

Harry Clarkson-Bennett, SEO Director at The Telegraph, explained that many publishers see little benefit in allowing AI training access. He noted that AI systems typically do not drive meaningful referral traffic, while publishers remain heavily dependent on traffic to sustain their businesses.

Retrieval Bots Are Also Being Shut Out

The study found that blocking retrieval bots is nearly as common as blocking training bots. Claude-Web was blocked by 66% of sites, while OpenAI’s OAI-SearchBot, which supports ChatGPT’s live search features, was blocked by 49%. ChatGPT-User faced restrictions on 40% of sites.

Perplexity-User, which handles user-initiated retrieval requests, was the least blocked retrieval bot at just 17%. This suggests some publishers may still be experimenting with limited visibility in AI-powered discovery tools.

Indexing Bots See Heavy Restrictions Too

PerplexityBot, which indexes pages for Perplexity’s search system, was blocked by 67% of publishers. Overall, only 14% of sites blocked every AI bot tracked in the study, while 18% allowed all of them.

The Limits of robots.txt

BuzzStream also highlighted a long-standing issue: robots.txt is not an enforcement mechanism. It is a guideline that well-behaved bots are expected to follow, but it does not physically prevent access.

Google’s Gary Illyes has previously confirmed that robots.txt functions more like a request than a barrier. Clarkson-Bennett echoed this concern, comparing it to a “please keep out” sign that can easily be ignored by bots designed to bypass restrictions.

Cloudflare has documented cases where Perplexity allegedly used evasive crawling techniques, including rotating IP addresses and masking its user agent to resemble a standard browser. As a result, Cloudflare removed Perplexity from its verified bot list and now actively blocks it. Perplexity has disputed these claims and issued a public response.

For publishers that want stronger protection, BuzzStream suggests moving beyond robots.txt to CDN-level blocking or advanced bot fingerprinting.

Why Retrieval Blocking Matters

The most striking takeaway is how many publishers are opting out of AI citation systems entirely. Blocking retrieval bots does not just affect future model training. It directly impacts whether a site can appear as a cited source in AI-generated answers today.

AI platforms separate their crawlers by function. OpenAI uses GPTBot for training and OAI-SearchBot for live search. Perplexity distinguishes between bots used for indexing and those used for real-time retrieval. Blocking one does not automatically block the other, making these decisions more complex than they appear.

The disparity in how Google-Extended is treated in the US and UK is also notable. Whether this reflects differing commercial relationships, risk assessments, or confidence in Gemini’s growth is unclear, but it is a trend worth monitoring.

What Comes Next

robots.txt remains a blunt tool for managing AI access, and publishers that are serious about blocking AI crawlers may need more robust technical controls. Cloudflare’s latest reports show that GPTBot, ClaudeBot, and CCBot receive the highest number of full disallow directives across major domains.

At the same time, many publishers continue to apply partial restrictions to Googlebot and Bingbot, reflecting the tension between protecting content from AI training and maintaining visibility in traditional search results.

For publishers tracking AI exposure, retrieval bots are the key area to watch. Blocking training bots shapes future AI models, but blocking retrieval bots determines whether your content appears in AI answers right now.

 

More Digital Marketing BLOGS here: 

Local SEO 2024 – How To Get More Local Business Calls

3 Strategies To Grow Your Business

Is Google Effective for Lead Generation?

What is SEO and How It Works?

How To Get More Customers On Facebook Without Spending Money

How Do I Get Clients Fast On Facebook?

How Do I Retarget Customers?

How Do You Use Retargeting In Marketing?

How To Get Clients From Facebook Groups

What Is The Best Way To Generate Leads On Facebook?

How Do I Get Leads From A Facebook Group?

>