Synthetic data: model collapse and epistemic...

Executive Summary

Observation: synthetic data has become essential for training and testing systems when real-world data is scarce, expensive, or dangerous to collect.

Documented risk: when used recursively for training, it can cause model collapse, progressively erasing rare cases and impoverishing the learned distribution.

Emerging risk: when the same model of the world produces both training and evaluation data, research may measure conformity to a simulator rather than robustness to reality.

Thesis: synthetic data should extend the reach of reality, not replace it. External, independent validation based on real-world data remains the indispensable anchor.

First things first: what science is

Before discussing synthetic data, we need to revisit the scientific method. Not out of pedantry because it is the heart of the problem.

Science is not a collection of truths. It is a process.

A process that begins with an observation, formulates a hypothesis, designs an experiment, gathers data, and confronts the model with reality. The decisive step is not confirmation. It is the possibility of being contradicted.

Karl Popper put it clearly: a scientific theory is not one that has been proven, but one that can be falsified. What cannot be false tells us nothing about the world. The value of an experiment lies precisely in its ability to reveal that the hypothesis was wrong.

The resistance of reality is what gives the process its value.

Reality contradicts. Reality spills beyond categories. Reality produces cases no one anticipated.

A serious experimental protocol is designed to maximize the chances that this resistance can express itself. We test out of distribution. We seek edge cases. We confront the model with different populations, different contexts, and errors unlike those we expected.

When that resistance disappears, science still looks like science. The equations are there. The tables are there. The p-values are there. But something essential has evaporated: the possibility of being disproved by something outside itself.

That is exactly the problem with synthetic data when it is misused.

Act 1 What synthetic data really is

Synthetic data is data generated by an algorithm rather than collected from the real world.

It can take the form of computer-generated images, simulated conversations, agent trajectories, generated medical cases, autonomous-driving scenarios, code examples, or question-and-answer pairs created to train a language model.

And it is often indispensable.

Take AlphaFold, DeepMind’s system for predicting protein structures. Experimental biology does not provide enough resolved protein structures to train a model at scale. Synthetic data simulated structures and modeled physical constraints helped bridge that gap. The result was validated against real structures and withstood the comparison. That is the right use: fill a gap in reality, then validate against reality.

Another example is autonomous-vehicle safety. We cannot deliberately cause thousands of fatal accidents to train a system to avoid them. Synthetic scenarios a pedestrian emerging from behind a bus, black ice at the exit of a tunnel make it possible to explore risk spaces that field data alone could never cover quickly enough.

Synthetic data is a powerful research prosthesis. It lets us move faster, test more broadly, and explore what the real world cannot provide in sufficient quantities.

But a prosthesis is not an organ.

This is where conceptual framing matters: synthetic data is not neutral data. It is a materialized hypothesis.

A synthetic conversation is a hypothesis about what a likely conversation looks like. A synthetic user is a hypothesis about what a user is. A synthetic edge case is a hypothesis about what the researcher imagines an edge case to be.

That is not nothing. It is valuable. But it is not neutral. Confusing the two opens the door to serious problems some of which are now documented, while others are still barely visible.

Act 2 What we know: model collapse during training

In July 2024, Shumailov et al. published a documented result in Nature: recursively training a model on synthetic data leads to model collapse.

The mechanism is straightforward: each generation encodes small errors and biases in its outputs. The next generation trains on this slightly degraded data. The tails of the distribution gradually disappear rare cases, anomalies, and the rough edges of the real world give way to an impoverished, smoothed-out version. In the paper’s iconic example, the process begins with a text about medieval architecture; nine generations later, the model produces a list of jackrabbits.

This finding has not been disproved it has been confirmed and developed further. The main nuance is that the mixing strategy is the critical variable. Replacing real data with synthetic data makes collapse inevitable and mathematically proven. Accumulating data retaining real data as a permanent anchor slows it down. But Strong Model Collapse (ICLR 2025) argues that even mixing is insufficient unless the synthetic fraction vanishes asymptotically.

The conclusion is paradoxical: every mitigation strategy requires a growing supply of fresh real-world data. Real data remains irreplaceable as an anchor. Synthetic data cannot substitute for the world it can only supplement it, provided it never replaces it.

That is the documented problem. The visible problem.

It concerns training. But synthetic data is used at three distinct levels, with risks of very different kinds:

For training: collapse has been demonstrated. The risk is technical, measurable, and partly mitigable.

For testing: the risk is coverage. A synthetic benchmark covers the paths its generator imagined not the out-of-distribution cases the real world will produce.

For scientifically validating a thesis: this is where the risk becomes epistemological. And this is where it remains largely undocumented.

Act 3 What we do not know yet: toward epistemic collapse?

The third level scientific validation is where I see a risk that is still rarely discussed. I am not a researcher. What follows is not an established diagnosis, but the observation of a pattern that, in my view, deserves attention.

Before describing it, one distinction matters. Artificial environments have always had a place in physics and economics: the vacuum chamber, the perfectly competitive market, Monte Carlo simulations. These models do not describe the real world they isolate variables that would be impossible to separate from real-world noise. No one claims that a perfect vacuum exists. The artificiality is acknowledged, disclosed, and integrated into the interpretation of the results.

What concerns me in AI is precisely when that distinction fades when synthetic data is used for validation without its hypothetical nature being clearly stated as such.

When synthetic data is used to validate a thesis, the loop risks closing in on itself. The system is no longer confronted with the real world. It is confronted with an artificial version of the world built from the same assumptions as the system itself. Reality’s resistance is not deliberately removed it is simply never invited in.

Consider a concrete example: the paper Compiling Agentic Workflows into LLM Weights (arXiv, May 2026). The idea is elegant instead of running an agentic workflow through an expensive external orchestrator, conversations are generated from that workflow, a small model is fine-tuned on them, and the procedure is “compiled” into its weights.

The paper presents an insurance workflow with 55 nodes and six decision hubs. The complexity is real.

But both the training and evaluation conversations are generated from the flowchart itself.

What exactly is being validated?

The model’s ability to handle real policyholders, with their hesitations, incomplete documents, misunderstandings, and legally ambiguous cases?

Or its ability to imitate the conversational grammar of a simulator built from the workflow?

The distinction is decisive. A simulator is not the world. A flowchart is not a population. Coverage of synthetic paths is not coverage of reality. In this setting, synthetic data does not impoverish the model it impoverishes the question being asked. It produces clean, readable, convincing results that align perfectly with the initial hypothesis because they were derived from it.

I am not saying that AI research is heading straight toward epistemic collapse. That would be precisely the kind of sweeping claim this article criticizes.

What I see is a historical precedent that should make us pay attention. The replication crisis in social psychology (2010–2020) revealed that dozens of studies published in serious journals, with technically sound results, could not be reproduced under different conditions. The root cause was not fraud. It was experimental conditions that were too controlled and too clean, testing the consistency of a protocol rather than the robustness of a phenomenon.

AI research is not social psychology. But it has access to a lever psychology did not have: the ability to generate its own experimental ground. That lever is powerful. It deserves to be handled with explicit awareness of what it can produce.

Act 4 The aggravating factor: a strained dissemination chain

So far, we have discussed methodological problems. Serious problems, but ones that remain within the scientific community, where corrective mechanisms exist peer review, replication, public criticism.

AI research has one distinctive feature that puts these mechanisms under pressure: its dissemination chain moves at unprecedented speed.

The temporal structure of the problem.

A paper submitted to arXiv is public within 24 hours. Serious peer review takes anywhere from three to eighteen months. Between the two lies a window in which the paper exists as a citable artifact with the format of a scientific article, equations, ablations, result tables, and references without having undergone external validation.

In most scientific fields, preprints initially circulate among researchers who have the tools to assess their solidity. In AI, dissemination is immediate and much broader.

The problem is not intent it is structure.

A content creator who covers AI seriously faces a stream of roughly fifty significant papers per week. Their audience expects regular content. The attention economy they operate in naturally rewards bold claims and spectacular discoveries and makes nuance, conditions, and “it depends” difficult to foreground.

This is not an individual weakness. It is a structural constraint that produces predictable systemic effects: the abstract becomes the paper. The title becomes the conclusion. The precise conditions under which a result holds disappear during compression.

“Claude feels emotions” circulates within hours. Anthropic’s nuance internal representations, causal role, no claim of subjectivity arrives days later, in threads few people read because the first post has already been shared widely.

“LLMs will collapse because of synthetic data” circulates. The accumulation-versus-replacement distinction, the optimal ratio, and the precise conditions of collapse disappear in transmission.

An arXiv preprint acquires the status of established truth before the community has had time to test it seriously.

And the loop closes.

What is structurally concerning is that this dissemination mechanism resembles the model collapse it sometimes describes poorly.

Papers cite preprints that have not yet been validated. Content creators compress and amplify them into simplified claims. Those claims enter the corpus used to train the next generation of LLMs. LLMs reproduce the simplified claims as though they were established because statistically, they are established in the corpus. Researchers use those LLMs for literature reviews. The next papers begin from those foundations.

This is not yet a demonstrated collapse it is a risk of drift. But the structure is there: a community that builds and transmits knowledge through successive cycles of compression eventually loses what made that knowledge precise.

The tails of the distribution may disappear. Nuances may fade. Edge cases, conditions, partial refutations everything that gives an honest scientific result its richness may dissolve with each cycle. It is not inevitable. But it deserves our attention.

What remains to be done

Synthetic data is not a problem. It is a tool. Like any powerful tool, it can be used rigorously or carelessly and the line between the two is not always visible from the outside.

For every result based on generated data, a few simple questions should be asked:

Is the generator independent of the model being tested?
Has the synthetic data been compared with real-world data?
Have out-of-distribution cases been tested?
Has the system been evaluated on real users, real errors, and real ambiguities?
Do the metrics measure robustness to the world, or conformity to the simulator?

And one final question for the dissemination chain:

Has this paper been peer-reviewed? How long has it been on arXiv? Under precisely which conditions does the result hold?

The right use of synthetic data is therefore not to replace reality, but to extend its reach. It can explore rare cases, balance distributions, and test hypotheses. But it must remain tied to external, independent, real-world validation.

Science is not a collection of bold claims. It is a process of resistance against the world.

And the world, regularly, has the bad manners not to resemble our hypotheses.

That is exactly why it is worth facing directly.

Synthetic data: from model collapse to epistemic risk