Table of Contents
Fetching ...

Position: Model Collapse Does Not Mean What You Think

Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo

TL;DR

The paper argues that the discourse on model collapse is distorted by eight conflicting definitions and unrealistic modeling assumptions. It reframes the problem under realistic pretraining conditions with accumulating real and synthetic data, showing that many catastrophic collapse claims rely on implausible setups. It concludes that population risk is unlikely to diverge in practice, though tail-data loss and shifts in scaling laws deserve careful study, and it proposes a standardized 'model collapse profile' to unify future work. The work advocates clearer definitions and best practices to redirect research toward the real harms and mitigations in synthetic-data regimes.

Abstract

The proliferation of AI-generated content online has fueled concerns over \emph{model collapse}, a degradation in future generative models' performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society's current trajectory have received disproportionately less attention.

Position: Model Collapse Does Not Mean What You Think

TL;DR

The paper argues that the discourse on model collapse is distorted by eight conflicting definitions and unrealistic modeling assumptions. It reframes the problem under realistic pretraining conditions with accumulating real and synthetic data, showing that many catastrophic collapse claims rely on implausible setups. It concludes that population risk is unlikely to diverge in practice, though tail-data loss and shifts in scaling laws deserve careful study, and it proposes a standardized 'model collapse profile' to unify future work. The work advocates clearer definitions and best practices to redirect research toward the real harms and mitigations in synthetic-data regimes.

Abstract

The proliferation of AI-generated content online has fueled concerns over \emph{model collapse}, a degradation in future generative models' performance when trained on synthetic data generated by earlier models. Industry leaders, premier research journals and popular science publications alike have prophesied catastrophic societal consequences stemming from model collapse. In this position piece, we contend this widespread narrative fundamentally misunderstands the scientific evidence. We highlight that research on model collapse actually encompasses eight distinct and at times conflicting definitions of model collapse, and argue that inconsistent terminology within and between papers has hindered building a comprehensive understanding of model collapse. To assess how significantly different interpretations of model collapse threaten future generative models, we posit what we believe are realistic conditions for studying model collapse and then conduct a rigorous assessment of the literature's methodologies through this lens. While we leave room for reasonable disagreement, our analysis of research studies, weighted by how faithfully each study matches real-world conditions, leads us to conclude that certain predicted claims of model collapse rely on assumptions and conditions that poorly match real-world conditions, and in fact several prominent collapse scenarios are readily avoidable. Altogether, this position paper argues that model collapse has been warped from a nuanced multifaceted consideration into an oversimplified threat, and that the evidence suggests specific harms more likely under society's current trajectory have received disproportionately less attention.

Paper Structure

This paper contains 14 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Model Collapse Has Been Defined in Multiple and Sometimes Conflicting Ways. By hand-annotating 28 prior research publications, we identify 8 definitions of model collapse (Sec. \ref{['sec:multiple_definitions']}). The 8 definitions can be loosely grouped into three families: (1) the behavior of the test loss on real data over model-fitting iterations (left), (2) the deformation of the real data distribution over model-fitting iterations (center) and (3) the scaling behavior of the test loss with respect to typical scaling quantities such as the amount of data (right).
  • Figure 2: Model Collapse has been defined in multiple and sometimes conflicting ways. We conduct a meta-analysis of research papers on model collapse. Top: We identify which papers offer any explicit definition of model collapse (Yes (Y) or No (N)), broadly construed. Bottom: We identify which definition(s) of model collapse each paper uses for its experimental and/or mathematical results, either explicitly (E) or implicitly (I). Our annotations reveal that research on model collapse is based on multiple definitions that we will show sometimes conflict between papers and even within individual papers.
  • Figure 3: Dimensions of Consideration for Model-Data Feedback Loops: Propagation of Data Over Time and Proportion of Real Data Over Time. When data are replaced after each model-fitting iteration (left), the proportion of real data immediately becomes zero after the first iteration, whereas when data instead accumulate (right), the proportion of real data falls asymptotically to zero. gerstgrasser2024modelkazdan2024collapsethriveperilspromisesdey2024universality showed that replacing data over time causes the population risk to diverge, whereas accumulating data avoids diverging population risk. In these works, synthetic data are assumed to grow linearly over time, contributing $n$ samples per model-sampling iteration. Credit: The bottom figure is copied from gerstgrasser2024model with permission.
  • Figure 4: Dimension of Consideration for Model-Data Feedback Loops: Timescales of Collapse. Characterizing the timescale over which one should expect collapse is an underappreciated but crucial consideration. Focusing on the discrete model of shumailov2023curse, the expected number of model-fitting iterations before total collapse is proportional to the number of data times the entropy of the initial data distribution (left). Taking this model at face value, this means that trillions of models can be trained before glimpsing the onset of collapse. However, total collapse is only the most extreme outcome; in this model, we additionally show how the entropy of the initial data distribution decays over time (right). Error bars are over 100 seeds (0 to 99, inclusive); for experimental details, see Sec. \ref{['sec:dims_of_consideration:subsec:timescales']}.