TL;DR

The AI industry’s easy supply of public web text is nearing a projected limit, according to research cited by Thorsten Meyer AI’s Control Series and a paper by Epoch AI researchers. Recent copyright settlements, publisher licenses, enterprise controls and sovereign data policies show high-value training data moving behind legal and commercial barriers. The timing of full public-data exhaustion remains uncertain, but unique, verified data is becoming harder to replace.

AI companies are facing a tighter data market in 2026, as public-web text nears projected training limits and high-value corpora move behind contracts, lawsuits, private enterprise systems and national controls, a shift that could make proprietary data a larger barrier than rented computing power.

Epoch AI researchers estimated in a paper on data limits that models could be trained on datasets roughly equal to the available stock of public human text between 2026 and 2032, with overtraining able to bring that date forward. Thorsten Meyer AI’s Control Series cites a roughly 300 trillion-token estimate for high-quality public text and a median exhaustion point around 2028.

The legal boundary also shifted. In the Anthropic authors case, U.S. District Judge William Alsup drew a distinction between training on lawfully acquired books and claims over pirated copies. The company later agreed to a $1.5 billion settlement covering about 465,000 books, according to The Associated Press; the deal does not cover future works.

Corporate data controls are tightening as well. Meta’s $14.3 billion investment for a 49% stake in Scale AI was followed by reported pullbacks from customers including Google, OpenAI and xAI, according to Business Insider. The episode underscored a wider concern: companies may not want sensitive training, labeling or evaluation data handled by a vendor aligned with a direct AI rival.

AI Dispatch · The Control Series · Part 3
Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑
Sovereign / real-world
Avengers combat data · FSD · ISR
can’t be bought
Expert-authored
PhDs, lawyers, surgeons define “good”
the new gold
Licensed content
paywalled, deal-only — now priced
fenced
Public web text
scraped for free — exhausting ~2028
commoditizing
~300T
public text tokens — used up 2026–2032
$1.5B
Anthropic authors settlement — scraping era ends
$14.3B
Meta for 49% of Scale — triggered an exodus
keep the model
Ukraine’s condition — data as sovereign asset
The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.
thorstenmeyerai.com · 03 / 06

Ownership Becomes The AI Moat

The shift matters because data, unlike cloud compute, is not interchangeable. A startup can rent processors and lease power, but it cannot duplicate a hospital’s patient records, a publisher’s archive, a company’s customer interactions, an autonomous-driving fleet’s road events or a military’s battlefield observations without access, consent and legal rights.

For companies using AI, this changes vendor risk. Proprietary logs, workflows, documents and customer behavior can become training material that improves a supplier’s model and, in some markets, a future competitor’s product. For creators and publishers, it increases bargaining power but may also favor companies able to pay large licensing bills.

Understanding Open Source and Free Software Licensing

Understanding Open Source and Free Software Licensing

Used Book in Good Condition

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

From Free Crawls To Contracts

The early generative AI boom relied on broad web crawls, books, code, forums and other large collections that were often treated as low-cost inputs. That approach is now colliding with copyright litigation, licensing talks and technical limits on public human-made text.

Synthetic data is one response. Nvidia acquired synthetic-data firm Gretel for a reported price above its prior $320 million valuation, according to Wired, and major labs use AI-generated examples in training. But research on model collapse warns that systems trained too heavily on machine-made output can lose rare information or compound errors, which raises the value of fresh, verified human data.

“You cannot rent data that no one else has.”

— Thorsten Meyer AI, The Control Series, Part 3

Asustor Drivestor 4 Gen 2 AS1204T, 4 Bay NAS, Quad-Core 1.7GHz CPU, 2.5GbE Port, 1GB DDR4, 3 USB 3.0, Best Budget Home Cloud, Small Office Backup, 4K Media Center, Network Attached Storage (Diskless)

Asustor Drivestor 4 Gen 2 AS1204T, 4 Bay NAS, Quad-Core 1.7GHz CPU, 2.5GbE Port, 1GB DDR4, 3 USB 3.0, Best Budget Home Cloud, Small Office Backup, 4K Media Center, Network Attached Storage (Diskless)

[Optimized Quad-Core Performance] Equipped with an upgraded Realtek RTD1619B 1.7GHz Quad-Core processor. This efficient CPU provides smooth multitasking…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Unknowns Around Scarce Training Data

It is not yet clear how close frontier labs are to the usable public-text ceiling, because training datasets are not fully disclosed and token-quality estimates vary. It is also unresolved how courts will rule in other AI copyright cases, including claims involving news publishers and model outputs.

The role of synthetic data is still developing. It may reduce data pressure in some domains, but the long-term mix of synthetic, licensed, expert-authored and proprietary data remains unsettled. The extent of sovereign control over military and battlefield datasets, including examples cited around Ukraine, is also not fully public.

Amazon

high-quality public web text datasets

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Licenses And Lawsuits Set Pace

The next milestones are legal and commercial: settlement administration in the Anthropic case, discovery and rulings in other publisher lawsuits, new licensing deals for archives and expert content, and tighter enterprise contracts around data use. AI providers are likely to keep seeking exclusive or semi-exclusive corpora, while customers press for stronger limits on how their data can be used.

For readers and businesses, the practical issue is contract control. The firms with unique data will need to decide whether to license it, keep it internal, or trade access for model rights, audit rights and clear limits on reuse.

AI TOOLS AND SECURITY: Protecting Data, Privacy, and Trust in the Age of Artificial Intelligence

AI TOOLS AND SECURITY: Protecting Data, Privacy, and Trust in the Age of Artificial Intelligence

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the news in this story?

The development is that AI’s data supply is becoming a business, legal and national-control chokepoint in 2026, rather than a freely available input scraped from the open web.

Has AI already used all public training data?

No. Researchers have projected a range, with public human text potentially becoming fully used by frontier training runs between 2026 and 2032. The exact timing is uncertain because datasets and training methods are not fully public.

Why does proprietary data matter more now?

As models and rented compute become easier to access, unique datasets can separate one AI system from another. Examples include enterprise workflows, paid archives, expert feedback, driving data, medical records and military observations.

Did the Anthropic settlement settle fair use for AI training?

No. The settlement resolved claims over how certain books were obtained. The broader legal questions around future training, outputs and other copyrighted datasets remain active in other cases.

What should companies do with their own data?

Companies should know what data they give AI vendors, what rights they grant, whether the data can improve general models, and whether their contracts restrict reuse, sharing and competitor access.

Source: Thorsten Meyer AI

You May Also Like

Personal Data Sovereignty and Decentralization

Decentralization and personal data sovereignty redefine digital privacy, empowering individuals to control their information—discover how this shift could change your digital life.

Why a Good Docking Station Simplifies Remote Work Faster Than Any App

Just imagine how a good docking station can streamline your remote work—discover why it’s faster and more effective than any app.

How to Choose the Right Monitor Size for Work and Focus

Optimize your workspace by choosing the right monitor size for focus and productivity—discover key factors to make the best choice today.

Advanced Connectivity: 5G, 6G, and LEO Satellites

Persistent advancements in 5G, 6G, and LEO satellites are transforming connectivity—discover how these innovations will shape your digital future.