Reddit CEO: LLMs Built on Reddit Data

Reddit CEO Steve Huffman has said that large language models “would not exist as we know them” without Reddit’s vast pool of user-generated content, describing the platform’s data as a form of “modern oil” for artificial intelligence development. His comments were made during an interview at Fast Company’s Most Innovative Companies Summit, where he spoke at length about Reddit’s growing importance in the AI ecosystem.

Reddit positioned as a core AI data source

Huffman emphasised that Reddit is one of the largest and most influential sources of training data used by AI systems today. He also claimed that Reddit is among the most frequently cited platforms across major language models, reinforcing its role in shaping how these systems respond to user prompts.

He explained that the conversational nature of Reddit makes it especially valuable for AI training, saying that models rely heavily on human discussions that span almost every topic imaginable. In his view, this type of natural dialogue is what allows AI systems to generate more human-like responses.

He stated:
“LLMs would not exist as we know them without Reddit… a large portion of that consumption is actually just the human conversation on Reddit because it’s natural and it covers basically every topic imaginable.”

Growing importance of data licensing deals

Reddit has already moved to formalise its position in the AI supply chain through licensing agreements with companies such as Google and OpenAI. Huffman referred to these as the company’s earliest major partnerships in the AI space and suggested they helped establish Reddit’s commercial value in training models.

He indicated that Reddit is now taking a more selective approach to new agreements, as the value of its data has become clearer across the industry. Rather than broadly opening access, the platform is focusing on controlled partnerships where usage terms are clearly defined.

At the same time, Reddit has become more active in protecting its data. The company has launched legal action against organisations including Anthropic and Perplexity, accusing them of scraping content without permission and violating platform terms.

Huffman drew a clear distinction between partners and non-partners, explaining that companies willing to work within licensing agreements are able to access Reddit data under structured conditions, while others face legal consequences.

He also reinforced Reddit’s stance that commercial use of its content requires commercial terms, noting that the platform introduced paid API access in 2023 as part of this shift.

Why Reddit changed its approach to data access

According to Huffman, Reddit’s more restrictive data policies are a response to changes in how the AI industry operates. He suggested that the shift away from open research has made it harder for platforms like Reddit to track how their data is being used once it is accessed externally.

He also said that in earlier years Reddit operated with a more open philosophy, reflecting the broader open internet culture. However, he believes that would only have been sustainable if AI development had remained open-source and transparent.

A key concern, he noted, is the lack of visibility over downstream use of Reddit content. Without knowing how data is being applied, Reddit has less control over potential misuse, including targeting users or replicating content in ways that bypass the platform.

This, he argued, is part of the reason Reddit now prioritises structured access agreements and legal protections over unrestricted availability.

Reddit’s own AI-powered tools

Despite supplying data to external AI systems, Reddit is also building its own artificial intelligence features. One of the most visible is Reddit Answers, a tool that uses language models to summarise discussions and respond to user questions.

Huffman explained that the feature is designed to reflect the platform’s core value of human perspective. Rather than generating standalone answers, it relies heavily on direct quotes from Reddit users and presents multiple viewpoints where appropriate.

The goal, he said, is to support queries that do not have a single correct answer while still grounding responses in real community discussions.

Behind the scenes, Reddit is also using AI for moderation and content classification. These systems help identify harmful or inappropriate content more efficiently than traditional manual review processes.

Huffman described this as a practical improvement to platform safety, noting that AI can help reduce the burden of reviewing highly sensitive or unpleasant material.

The challenge of AI-generated posts

Huffman also addressed a growing issue on the platform: users posting content written with AI tools such as ChatGPT. He made a distinction between automated bots and human users who simply rely on AI to generate text.

While acknowledging that there is still a human behind the idea, he admitted that AI-generated posts often lack quality and are easily identified by the community.

He suggested that rather than introducing strict enforcement policies, Reddit will continue to rely on community moderation. Users already tend to downvote AI-written content and challenge it in comment sections, effectively filtering it out organically.

Huffman compared the situation to broader technological shifts, noting that society is still adjusting to how AI fits into everyday communication and writing.

Balancing openness, control, and growth

Overall, Huffman’s comments highlight the tension Reddit faces as both a major contributor to AI training data and a platform trying to maintain control over its content.

On one hand, Reddit’s discussions are clearly valuable to AI developers and have led to major licensing deals. On the other, the company is increasingly focused on protecting its intellectual property and ensuring proper commercial use.

Legal action against firms accused of unauthorised scraping reflects this shift, while ongoing partnerships with companies like Google and OpenAI show Reddit’s willingness to collaborate under the right conditions.

Looking ahead

Huffman confirmed that Reddit continues to explore additional data partnerships, although no new deals were announced during the interview.

With ongoing lawsuits and expanding AI integration across the platform, Reddit appears to be positioning itself as both a key infrastructure provider for AI training and a gatekeeper of its own content.

As the AI landscape continues to evolve, Reddit’s role is likely to become even more central—both as a data source and as a platform shaping how that data can be used.