The Data Drought — How AI Drank the Internet Dry

The Internet Was the Training Set

The story of modern AI is inseparable from the story of the internet. Large language models - GPT-4, Claude, Gemini, Llama - were trained on datasets scraped from across the web: books, academic papers, forums, news articles, source code, social media, product reviews, and more. The scale is difficult to comprehend. We are talking about hundreds of billions of words, representing decades of accumulated human knowledge, thought, and conversation.

This worked extraordinarily well. By exposing models to the breadth of human language and reasoning, researchers produced systems capable of writing, coding, analysing, translating, and problem-solving at levels that surprised even the people who built them. But that success has created a structural problem: the internet is not infinite, at least not relative to the appetite of next-generation AI systems.

When the Well Runs Dry

Researchers at Epoch AI published a sobering analysis estimating that the supply of high-quality, publicly available text data could be effectively exhausted by 2026. This is not about the internet ceasing to produce new content - billions of words are published every day. The constraint is on quality and novelty. Most of the high-value training data that exists has already been consumed by models that are already deployed.

The remaining untapped data tends to be lower quality: repetitive, poorly structured, or too niche to generalise from. Scraping more of it may actually degrade model performance rather than improve it. And training on AI-generated content - using one model's output to train the next - risks creating echo chambers that amplify errors and hallucinations rather than correct them.

Some of the most valuable data - private messages, corporate documents, medical records, legal filings - is legally and ethically inaccessible. The open internet, which powered the first wave of AI breakthroughs, may simply not be enough to power the next.

The Industry Response

The response from AI labs has been pragmatic. OpenAI struck deals with the Associated Press, the Financial Times, News Corp, and several book publishers to access proprietary content. Google has similar arrangements. These partnerships give AI companies access to curated, high-quality text in exchange for licensing fees - a model that will likely become standard across the industry.

Others are betting on synthetic data: generating new training examples computationally rather than harvesting them from the wild. This works particularly well for structured tasks like mathematics, coding, and logical reasoning, where you can verify whether a generated example is correct. For open-ended language tasks, the results are more mixed.

There is also a growing shift toward multimodal training - using video, audio, and images alongside text. A model that learns by watching humans perform tasks, explain concepts, or reason through problems is less constrained by the limits of text corpora alone.

What This Means for Your Business

For most small and medium businesses, the data drought is not an immediate operational problem. The AI tools available today - including those embedded in platforms like Acqui.app - are already built on models powerful enough to deliver real, measurable value. That capability is not disappearing.

But the drought does highlight something important: proprietary data is becoming a strategic asset. Businesses that collect, organise, and maintain rich data about their customers, operations, and markets are sitting on something increasingly valuable - not just for internal decision-making, but potentially as training input for custom AI models built specifically for their industry.

The first wave of AI was powered by public data. The next wave will be powered by private data. That puts businesses of every size in a stronger position than most of them realise - if they start treating their data seriously now.

The Takeaway

AI has consumed much of what the open internet had to offer. The next phase of development will be defined by proprietary data deals, synthetic generation pipelines, and specialised models trained on high-quality, narrow-domain corpora. For businesses, the implication is straightforward: start building and organising your own data now. The businesses that arrive at the next wave of AI already holding clean, structured, proprietary datasets will have a compounding advantage that latecomers will struggle to close.

The Internet Was the Training Set

When the Well Runs Dry

The Industry Response

What This Means for Your Business

The Takeaway

More from the blog

Is AI Moving Too Fast to Build On?

AI Killed Marketing

Demystifying Double-Entry Accounting