TL;DR
The AI industry’s easy supply of public web text is nearing a projected limit, according to research cited by Thorsten Meyer AI’s Control Series and a paper by Epoch AI researchers. Recent copyright settlements, publisher licenses, enterprise controls and sovereign data policies show high-value training data moving behind legal and commercial barriers. The timing of full public-data exhaustion remains uncertain, but unique, verified data is becoming harder to replace.
AI companies are facing a tighter data market in 2026, as public-web text nears projected training limits and high-value corpora move behind contracts, lawsuits, private enterprise systems and national controls, a shift that could make proprietary data a larger barrier than rented computing power.
Epoch AI researchers estimated in a paper on data limits that models could be trained on datasets roughly equal to the available stock of public human text between 2026 and 2032, with overtraining able to bring that date forward. Thorsten Meyer AI’s Control Series cites a roughly 300 trillion-token estimate for high-quality public text and a median exhaustion point around 2028.
The legal boundary also shifted. In the Anthropic authors case, U.S. District Judge William Alsup drew a distinction between training on lawfully acquired books and claims over pirated copies. The company later agreed to a $1.5 billion settlement covering about 465,000 books, according to The Associated Press; the deal does not cover future works.
Corporate data controls are tightening as well. Meta’s $14.3 billion investment for a 49% stake in Scale AI was followed by reported pullbacks from customers including Google, OpenAI and xAI, according to Business Insider. The episode underscored a wider concern: companies may not want sensitive training, labeling or evaluation data handled by a vendor aligned with a direct AI rival.
Data: The One Thing You Can’t Rent
The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.
Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.
Ownership Becomes The AI Moat
The shift matters because data, unlike cloud compute, is not interchangeable. A startup can rent processors and lease power, but it cannot duplicate a hospital’s patient records, a publisher’s archive, a company’s customer interactions, an autonomous-driving fleet’s road events or a military’s battlefield observations without access, consent and legal rights.
For companies using AI, this changes vendor risk. Proprietary logs, workflows, documents and customer behavior can become training material that improves a supplier’s model and, in some markets, a future competitor’s product. For creators and publishers, it increases bargaining power but may also favor companies able to pay large licensing bills.

Understanding Open Source and Free Software Licensing
Used Book in Good Condition
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
From Free Crawls To Contracts
The early generative AI boom relied on broad web crawls, books, code, forums and other large collections that were often treated as low-cost inputs. That approach is now colliding with copyright litigation, licensing talks and technical limits on public human-made text.
Synthetic data is one response. Nvidia acquired synthetic-data firm Gretel for a reported price above its prior $320 million valuation, according to Wired, and major labs use AI-generated examples in training. But research on model collapse warns that systems trained too heavily on machine-made output can lose rare information or compound errors, which raises the value of fresh, verified human data.
“You cannot rent data that no one else has.”
— Thorsten Meyer AI, The Control Series, Part 3

Asustor Drivestor 4 Gen 2 AS1204T, 4 Bay NAS, Quad-Core 1.7GHz CPU, 2.5GbE Port, 1GB DDR4, 3 USB 3.0, Best Budget Home Cloud, Small Office Backup, 4K Media Center, Network Attached Storage (Diskless)
[Optimized Quad-Core Performance] Equipped with an upgraded Realtek RTD1619B 1.7GHz Quad-Core processor. This efficient CPU provides smooth multitasking…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Unknowns Around Scarce Training Data
It is not yet clear how close frontier labs are to the usable public-text ceiling, because training datasets are not fully disclosed and token-quality estimates vary. It is also unresolved how courts will rule in other AI copyright cases, including claims involving news publishers and model outputs.
The role of synthetic data is still developing. It may reduce data pressure in some domains, but the long-term mix of synthetic, licensed, expert-authored and proprietary data remains unsettled. The extent of sovereign control over military and battlefield datasets, including examples cited around Ukraine, is also not fully public.
high-quality public web text datasets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Licenses And Lawsuits Set Pace
The next milestones are legal and commercial: settlement administration in the Anthropic case, discovery and rulings in other publisher lawsuits, new licensing deals for archives and expert content, and tighter enterprise contracts around data use. AI providers are likely to keep seeking exclusive or semi-exclusive corpora, while customers press for stronger limits on how their data can be used.
For readers and businesses, the practical issue is contract control. The firms with unique data will need to decide whether to license it, keep it internal, or trade access for model rights, audit rights and clear limits on reuse.

AI TOOLS AND SECURITY: Protecting Data, Privacy, and Trust in the Age of Artificial Intelligence
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What is the news in this story?
The development is that AI’s data supply is becoming a business, legal and national-control chokepoint in 2026, rather than a freely available input scraped from the open web.
Has AI already used all public training data?
No. Researchers have projected a range, with public human text potentially becoming fully used by frontier training runs between 2026 and 2032. The exact timing is uncertain because datasets and training methods are not fully public.
Why does proprietary data matter more now?
As models and rented compute become easier to access, unique datasets can separate one AI system from another. Examples include enterprise workflows, paid archives, expert feedback, driving data, medical records and military observations.
Did the Anthropic settlement settle fair use for AI training?
No. The settlement resolved claims over how certain books were obtained. The broader legal questions around future training, outputs and other copyrighted datasets remain active in other cases.
What should companies do with their own data?
Companies should know what data they give AI vendors, what rights they grant, whether the data can improve general models, and whether their contracts restrict reuse, sharing and competitor access.
Source: Thorsten Meyer AI