LLMs.txt: New Standard Proposed for AI Content Crawling

A new proposal aims to address the crawling and indexing needs of large language models (LLMs), and it’s catching the attention of the digital world. Created by Australian technologist Jeremy Howard, the proposed llms.txt functions similarly to existing protocols like robots.txt and XML sitemaps. It is designed to improve LLMs’ efficiency by reducing the resource load when crawling and indexing entire websites.

One of its standout features is “full content flattening,” which could benefit brands and content creators by making their content more easily accessible to AI systems. This proposal comes at a time when the interaction between AI and online content is rapidly evolving, sparking interest as well as debate within the digital community.

While some creators see potential advantages, others remain sceptical about how it might impact content ownership and control. Nevertheless, given the growing influence of AI, llms.txt is shaping up to be a key topic in discussions about the future of web content and AI-driven indexing.

The new proposed standard for AI accessibility to website content

Bluesky CEO Jay Graber brought attention to content creator rights and data control during a session at SXSW Interactive in Austin, Texas, on 10 March. The discussion highlighted concerns over how user-generated content is used to train AI systems and the growing need for creators to have more control over their data.

A detailed proposal addressing this issue sparked interest, but a simpler protocol called llms.txt, introduced last September, may offer a more practical solution. While it isn’t as expansive as the earlier proposal, llms.txt aims to give web content creators greater control by specifying what AI models can access and how much content can be used.

These proposals aren’t mutually exclusive, though llms.txt appears to be further along in terms of implementation. Developed by technologist Jeremy Howard, it functions as a crawl and indexing standard for websites, using simple markdown language.

As AI models continue to consume vast amounts of online content, content owners are increasingly looking for ways to regulate how their data is utilised. llms.txt could offer a middle ground, providing context for how creators prefer their content to be handled while reducing the strain on AI models.

Unlike major search engines like Google or Bing, which possess advanced crawling capabilities, LLMs benefit from focusing on their core “intelligence” functions rather than extensive crawling. In theory, llms.txt could improve efficiency by allowing AI models to use technical resources more effectively.

This article delves into several key aspects of the proposal, including:

What llms.txt is and how it works.
How content creators might use it.
Whether LLMs and website owners are adopting it.
Why it could be significant for the future of AI-driven content use.

What llms.txt is and what it does

To better explain what this new standard aims to achieve, it’s useful to quote directly from Howard’s proposal:

“Large language models increasingly rely on website information but face a critical limitation: context windows are too small to handle most websites in their entirety. Converting complex HTML pages with navigation, ads, and JavaScript into LLM-friendly plain text is both difficult and imprecise.

“While websites serve both human readers and LLMs, the latter benefit from more concise, expert-level information gathered in a single, accessible location. This is particularly important for use cases like development environments, where LLMs need quick access to programming documentation and APIs.

“We propose adding a /llms.txt markdown file to websites to provide LLM-friendly content… llms.txt markdown is human and LLM-readable, but is also in a precise format allowing fixed processing methods (i.e. classical programming techniques such as parsers and regex).”

The proposed protocol presents some intriguing potential, especially in terms of GEO benefits. Since December, I have been testing its effectiveness and exploring its applications.

At its core, llms.txt allows website owners to manage how their content can be accessed and used by AI-powered models. Similar to robots.txt, which directs how search engine crawlers interact with websites, llms.txt is designed to establish guidelines for AI models that gather and process content for training or generating responses.

Unlike robots.txt, however, llms.txt does not rely on blocking. Its function is not to prevent access with directives like “Disallow.” Instead, it focuses on providing options that allow content creators to select what should be shared, either fully or contextually, with AI platforms.

The process is relatively straightforward. You can include URLs from specific sections of your website, add URLs with summaries, or even supply the full text of a page, either in a single file or divided into multiple ones.

For example, the llms.txt file on one of my websites contains the entire flattened text of the site. It is 115,378 words long, weighs in at 966 kilobytes, and is stored as a single .txt file in the domain root. However, your file could be smaller, larger, or split into separate files, depending on your needs. You could also organise the file across various directories within your website’s structure.

Another useful option is creating .md markdown files for individual web pages that you want to highlight for LLM attention. This feature is particularly helpful when performing in-depth site analysis. Importantly, llms.txt isn’t solely for LLM use – just as websites serve diverse purposes, this protocol has a range of potential applications. It offers flexibility for managing how content is presented to AI systems and ensures that website owners retain some degree of control in this evolving digital landscape.

Why llms.txt could matter for SEO and GEO

Here’s a paragraphed and refined version in British English:

Controlling how AI models interact with your content is crucial, and providing a fully flattened version of your website can significantly simplify AI extraction, training, and analysis. There are several key reasons why this approach could be beneficial.

One important factor is the protection of proprietary content. By using llms.txt, content creators can theoretically prevent AI models from using original content without permission. However, this protection only applies to LLMs that choose to follow the directives outlined in the file.

Another advantage lies in brand reputation management. By guiding how information from your website appears in AI-generated responses, businesses may gain some control over their online image and how they are represented in AI-driven contexts.

From a technical standpoint, having a fully flattened version of your website also opens up new possibilities for linguistic and content analysis. With this easily consumable format, you can carry out various types of analysis that would normally require specialised tools. This includes keyword frequency tracking, taxonomy analysis, entity analysis, link audits, and competitive analysis, among others.

Enhanced AI interaction is another potential benefit. By using llms.txt, large language models can interact more efficiently with your website, retrieving accurate and relevant information. No formal standard is necessary for this – simply providing a clean, flattened file of your complete content is enough to improve AI processing.

This approach could also improve content visibility. By directing AI systems towards specific content on your site, llms.txt might help optimise your website for AI indexing. This, in turn, could enhance your visibility in AI-powered search results. Similar to SEO, there are no guarantees, but any preference an LLM may show toward llms.txt would be a step forward.

Better AI performance is another key outcome. By enabling LLMs to access the most valuable and relevant content on your site, llms.txt can contribute to more accurate AI-generated responses, whether users are engaging with chatbots or AI-enhanced search engines. Personally, I have found that the “full” rendering of llms.txt provides the most benefit, as summaries or simple URL lists do not seem any more useful than traditional tools like robots.txt or XML sitemaps.

Finally, using llms.txt could offer a competitive advantage. As AI technology continues to evolve, having a well-configured llms.txt file could position your website to be more AI-ready, potentially setting it apart from less prepared competitors.

Challenges and limitations

Here’s a paragraphed version in British English to make it look less like a direct copy:

While llms.txt offers a promising approach, there are still several challenges that need to be addressed for it to be effective. One major concern is adoption by AI companies. Not every AI company may follow the proposed standard, and some might choose to ignore the file altogether, continuing to ingest website content regardless of the directives in llms.txt.

There is also the issue of adoption by websites themselves. For llms.txt to succeed, brands and website operators will need to actively participate. Achieving critical mass will be essential, even if not every site adopts the protocol. Without widespread involvement, its impact could be limited. In the absence of any other structured “optimisation” approach for AI interactions, the argument is: what have we got to lose? (That said, referring to this process as “optimisation” in relation to generative AI seems somewhat outdated and linguistically imprecise.)

Another challenge arises from potential overlaps and conflicts with existing protocols like robots.txt and XML sitemaps. This could lead to inconsistencies and confusion for site operators. It is important to emphasise that llms.txt is not intended to replace robots.txt. As mentioned earlier, the most practical use of llms.txt appears to be in providing a “full” rendering of website content in a simple, flattened text format.

The potential for misuse also cannot be ignored. Similar to keyword stuffing during the early days of SEO, there is nothing stopping website owners from overloading their llms.txt files with excessive text, keywords, links, and unnecessary content. This could lead to spam-like behaviour, reducing the intended usefulness of the protocol.

Additionally, exposing your website’s content in a simplified text file could make it easier for competitors to analyse your content for their own benefit. Competitive keyword research and content analysis are not new practices, but the simplicity of llms.txt could make it even easier for others to assess what your site offers – and what it might be lacking – to gain a strategic advantage.

There are also critical voices within the SEO and digital marketing community regarding the usefulness of llms.txt. Brett Tabke, CEO of Pubcon and WebmasterWorld, shared his scepticism about the protocol in a recent discussion. He argued that it doesn’t offer much added utility, pointing out that LLMs are not fundamentally different from traditional web crawlers. According to Tabke, the distinction between search engines and LLM-based systems has become increasingly blurred, especially as platforms like Google integrate LLMs into their core functionality. He observed that Google’s AI-driven search results already blur the line between search engines and LLMs, and he expects this trend to continue. In his view, llms.txt merely adds another layer of complexity without solving a real problem.

Tabke also highlighted that existing tools like XML sitemaps and robots.txt already serve a similar purpose. On this point, it’s hard to disagree. For some, the primary value of llms.txt lies in the potential of the “full” text rendering version, which provides a more detailed snapshot of a website’s content.

Marketer David Ogletree expressed similar concerns, cautioning against the misconception that LLMs are fundamentally different from traditional search engines. He argued that LLMs and platforms like Google are essentially performing the same function and should be treated similarly when it comes to managing content access and indexing.

Ultimately, while llms.txt presents an interesting concept, it remains a subject of debate. Whether it will gain widespread adoption or remain a niche tool depends largely on how AI companies, website owners, and the broader digital community choose to engage with it.

The future of llms.txt and AI content governance

As the adoption of AI technology continues to expand, the need for structured content governance becomes increasingly important. This is where llms.txt comes in, offering an early attempt to enhance transparency and provide website owners with greater control over how their content is accessed and utilised by AI systems.

Whether llms.txt will be widely adopted, however, remains uncertain. Its success will largely depend on several factors, including support from the industry, engagement from website owners, regulatory developments, and the willingness of AI companies to adhere to its directives.

It is crucial to stay informed about developments related to llms.txt and be prepared to adjust content strategies as AI-driven search and content discovery evolve. Keeping an eye on these changes could help ensure that your website remains optimised for both human visitors and AI systems.

The introduction of llms.txt represents a meaningful step toward balancing the rapid pace of AI innovation with the protection of content ownership rights. By addressing issues around the “crawlability and indexability” of websites, it aims to improve how large language models (LLMs) access and analyse online content.

Proactively exploring the implementation of llms.txt could be a valuable move for safeguarding your digital assets. At the same time, it provides AI systems with the necessary structure to better understand your website’s content and layout. This dual benefit may enhance both your control over AI interactions and the accuracy of AI-generated information derived from your site.

As AI continues to reshape the way online search and content distribution work, having a clear strategy for managing AI interactions with your website will be essential. Taking steps now to prepare for these ongoing changes could give you a competitive edge in an increasingly AI-driven digital landscape.