The Internet Is Training on Its Own Pollution

A few weeks ago, a founder sent me an article about cash management for very small businesses. It looked competent at first glance. The structure was clean, the tone was polished, and the advice sounded familiar enough to pass.

Then the seams started showing.

Some numbers did not add up. A few “standard” methods seemed to exist nowhere outside that page. Certain phrases repeated with the smooth, generic rhythm that has become its own warning sign. The article had almost certainly been generated by an early language model, from the period when confident fabrication came bundled with fluency.

What made it unsettling was not the article itself. It was its age.

The piece was already three years old. It had been shared widely, indexed everywhere, and ranked on the first page of search. Which means it was no longer just bad content sitting in a forgotten corner of the web. It had become part of the informational environment. It was available to search engines, to readers, to research tools, and very likely to the systems now scraping the internet for training data.

That is the part people still underestimate. The damage from low-quality synthetic content is not confined to the moment it is published. It lingers. It gets cited. It gets paraphrased. It gets folded into summaries and recommendation engines. Then newer systems encounter it as if it were part of the natural history of knowledge.

The web is filling with material that was generated quickly, checked poorly, and distributed at industrial scale. Newer models are better than their predecessors in almost every obvious way, but they still learn from a world that earlier models already helped contaminate.

The first wave never disappeared

It is tempting to think of the rough early years of generative AI as a solved problem. The models from 2022 and 2023 hallucinated constantly. Today’s models are stronger at reasoning, better at grounding, and less likely to invent nonsense with the same cheerful abandon. That part is real.

The older output did not vanish, though. It stayed online.

Between the launch of mainstream generative tools and the current moment, millions of articles, videos, product reviews, ebooks, and social posts were produced by systems that were persuasive long before they were reliable. A decent share of that material was never verified by a human who understood the subject. Some of it was harmless filler. Some of it was wrong in subtle ways. Some of it was pure fabrication wearing business-casual clothes.

The common reply is that the web was always messy. Of course it was. Human writing has always included mistakes, ideology, cargo-cult expertise, and SEO sludge. But generative systems changed the production function. They made it trivial to create plausible mediocrity by the ton. The old internet had garbage. The new one can manufacture it on demand, twenty-four hours a day, in every niche.

That scale changes the problem. A single inaccurate blog post is noise. Ten thousand derivative versions of it start to look like consensus.

Recursive learning is not a science-fiction problem

Researchers sometimes describe a version of this as model collapse: systems trained repeatedly on synthetic output can lose diversity, flatten nuance, and amplify errors, especially in the tails of a distribution where rare facts live. That term can be overstated. Frontier models are not simply downloading random AI spam and swallowing it whole. Labs filter aggressively. They use curated datasets, human preference data, and increasingly, synthetic data generated under controlled conditions.

So the immediate risk is less cinematic than “the model eats itself and dies.”

It is more like epistemic drift.

If enough of the public web becomes derivative, then the easiest-to-reach version of reality gets fuzzier. Retrieval systems surface polluted documents. Search results point to pages built from summaries of summaries. A model asked about an obscure tax rule, a minor historical event, or a brown dwarf’s atmospheric chemistry may encounter a stack of sources that all resemble one another because they came from the same synthetic family tree.

The center of the distribution survives longer than the edges. Basic facts about the French Revolution or the Krebs cycle can still be checked against many durable sources. Long-tail knowledge is more fragile. That is where invented details can enter the record and then circulate back as if they had always been there.

You can already see the pattern. A generated article introduces a false but memorable claim. A video script reuses it. That video gets views. A newsletter paraphrases the video. Soon the claim has enough surface area to look established. By the time another model encounters it, the lie no longer appears as a lonely error. It appears as a small cluster.

Pollution becomes infrastructure

Once synthetic material is indexed, ranked, clipped, and summarized, it stops behaving like content and starts behaving like infrastructure.

Search is the obvious example. For years, the web’s bargain with creators was simple enough: publish something useful, and search might send you readers. That system was never pure, but it was legible. Now the path is stranger. AI summaries increasingly sit between the user and the source, compressing many pages into an answer-shaped object. Sometimes that object is decent. Sometimes it is a blender full of half-truths. Either way, it reduces the chance that a reader will inspect the underlying material.

When traffic no longer reliably flows back to sources, fewer people invest in making high-quality sources in the first place. The web starts consuming its seed corn.

Reviews show the same pattern. Bot-written praise and bot-written complaints are no longer a novelty. Fake users can fill a product page, a map listing, or an app store with synthetic testimony at a volume that used to require a click farm. The point is not always persuasion. Sometimes it is friction. If every signal can be cheaply counterfeited, trust becomes expensive.

Academic publishing, once imagined as a protected layer above the mess, is not immune either. Analyses of millions of papers have found sudden increases in words and turns of phrase strongly associated with language models. That does not prove every such paper is fraudulent, and assistance is not the same as misconduct. Plenty of researchers use AI to polish grammar or draft boilerplate. Still, the trend matters because disclosure has not kept pace with use.

More worrying are the cases where people exploit the fact that other people are using AI. Some researchers have embedded hidden prompts in papers, sometimes in white text or invisible formatting, apparently aimed at automated reviewers or summarizers. The mechanism is almost funny in a cursed sort of way. The implication is not. We are already building systems that read on our behalf, and writers are already learning to manipulate those readers.

This is how contamination spreads now. It does not arrive with a dramatic explosion. It settles into the joints of the system.

Cleanup is harder than people think

Every proposed fix sounds reasonable until it meets the internet as it actually exists.

Watermarking generated content would help at the point of creation, but only if major model providers adopted it, preserved it, and made it difficult to strip. None of those conditions are guaranteed. Open models complicate enforcement. Simple transformations can remove traces. And watermarking does nothing for the mountain of synthetic material already published.

AI detectors are worse. They fail in both directions that matter. They accuse humans of writing like machines, which is insulting and occasionally consequential, and they miss machine-generated text that has been lightly edited by a competent person. The more fluent models become, the shakier this whole category gets.

Human curation works, but only in bounded spaces. A serious editorial process can still produce trustworthy material. So can a disciplined research team, a careful forum, or a respected reference database. The trouble is scale. The open web is too large, too dynamic, and too economically misaligned for manual review to clean it comprehensively.

Some people suggest training models only on pre-2022 data, before the synthetic flood really hit. That might reduce contamination, but it also freezes the world. You lose recent science, current law, fresh reporting, new terminology, and live culture. A model that knows nothing after 2022 is clean in one sense and badly crippled in another.

There is another complication people gloss over. Synthetic data is not inherently bad. In some contexts it is useful, even powerful. Carefully generated reasoning traces, simulated conversations, translated examples, and domain-specific augmentations can improve models when they are produced under strict controls. The issue is not machine-made text as such. It is unlabeled, unverifiable synthetic text entering the commons and masquerading as observation.

That distinction matters because it tells you why this problem feels slippery. We are not dealing with a simple category of banned material. We are dealing with provenance, incentives, and incentives are stubborn things.

The economic damage comes first

The internet runs on attention, and attention is one of the few things that does not scale.

If low-cost synthetic media gets good enough to seize most casual attention, then the loss shows up before the underlying truth crisis becomes obvious. You see it in traffic patterns, creator burnout, and the slow devaluation of effort. A channel that spends weeks researching a video now competes with a pipeline that can publish several plausible imitations in the same period. An independent writer who checks sources now competes with a swarm of pages designed to catch a query and hold a visitor just long enough for an ad impression.

The immediate result is not universal deception. It is dilution.

The average user becomes surrounded by material that is serviceable, forgettable, and slightly untrustworthy. That environment trains behavior. People skim more. They verify less. They stop expecting distinct voices, because the median page sounds like every other page. Over time, genuine expertise gets harder to perceive through the fog, a bit like trying to identify a live instrument in a song buried under compression artifacts.

This also creates pressure on honest creators. Some will use AI responsibly as a tool for editing, coding, transcription, or layout. That is normal and often sensible. Others will adopt it defensively, because the market punishes slowness more quickly than it rewards care. A publication that refuses all automation may be noble and broke by Thursday. A publication that automates everything may survive longer while saying less.

That tension is why the current moment feels so corrosive. The technology is genuinely useful. The surrounding incentive structure is what turns utility into sludge.

A narrower web, with clearer lines

The old dream of an open web where quality naturally rises to the top was already fraying before generative AI arrived. What changes now is the degree of skepticism required to move through it intelligently. General trust is being replaced by selective trust. Broad discovery is giving way to narrower circuits: publications with real editors, creators with known standards, communities with active moderation, archives with provenance, experts whose work can be checked.

That sounds smaller because it is smaller.

The practical response is not to reject AI wholesale or pretend human writing is automatically virtuous. People lie, plagiarize, posture, and get things wrong without any machine assistance. The practical response is to value processes that leave traces. Show sources. Link primary material. Disclose how AI was used when it materially shaped the output. Build products and institutions where verification is part of the cost, not an optional extra sacrificed to speed.

A healthier internet from here probably looks less like a giant undifferentiated feed and more like a map of places that have earned belief. That will be slower. It may also be more expensive. But the cheaper path is already visible all around us: pages no one remembers, videos no one trusts, summaries all the way down.