Table of Contents
Fetching ...

Strong Model Collapse

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe

TL;DR

This work demonstrates a strong form of model collapse: even a small fraction of synthetic data in training can halt the benefits of scaling, as larger datasets do not guarantee better generalization. By developing two tractable models—the classical linear ridge and a random projections regime—the authors derive a new bias-variance decomposition, E_test ≃ B + V + ζ, where ζ captures collapse due to synthetic data. They show that the collapse term remains nonvanishing unless synthetic data is removed, and that model size effects are nuanced: bigger models can worsen collapse before interpolation but may mitigate it afterward, producing a double-descent phenomenon. Experimental validation across MNIST and language-model tasks corroborates the theory, and the work further analyzes attempts at strategic data mixing, finding single-step mixing insufficient to prevent collapse, with iterative schemes offering limited practical gains. The results underscore the importance of real data curation in scaling AI systems and provide a framework for understanding and mitigating synthetic-data-induced degradation in large-scale learning.

Abstract

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

Strong Model Collapse

TL;DR

This work demonstrates a strong form of model collapse: even a small fraction of synthetic data in training can halt the benefits of scaling, as larger datasets do not guarantee better generalization. By developing two tractable models—the classical linear ridge and a random projections regime—the authors derive a new bias-variance decomposition, E_test ≃ B + V + ζ, where ζ captures collapse due to synthetic data. They show that the collapse term remains nonvanishing unless synthetic data is removed, and that model size effects are nuanced: bigger models can worsen collapse before interpolation but may mitigate it afterward, producing a double-descent phenomenon. Experimental validation across MNIST and language-model tasks corroborates the theory, and the work further analyzes attempts at strategic data mixing, finding single-step mixing insufficient to prevent collapse, with iterative schemes offering limited practical gains. The results underscore the importance of real data curation in scaling AI systems and provide a framework for understanding and mitigating synthetic-data-induced degradation in large-scale learning.

Abstract

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.
Paper Structure (70 sections, 12 theorems, 127 equations, 10 figures)

This paper contains 70 sections, 12 theorems, 127 equations, 10 figures.

Key Result

Theorem 1

Define $\sigma^2 := p_1 \sigma_1^2 + p_2 \sigma_2^2$ and let $\kappa,u \ge 0$ be as previously constructed. In the proportionate scaling limit eq:proportionate, the test error w.r.t the true data distribution $P_1$, of the classical linear model $\widehat{f}_{CL}$ defined in eq:ridge is given by $E_

Figures (10)

  • Figure 1: Pareto diagram: Understanding the role of model size in model collapse. We compare the test error (on the real / true data distribution), for a random projections model (equation \ref{['eq:ridge-randproj']} of Section \ref{['sec:models']}) when training is done on a mix of synthetic and real data (y-axis), versus real data only (x-axis); in both cases, the total amount of training data is fixed to $n=500$. On the scatter plots, square points correspond to very high-quality synthetic data (i.e from a distribution which is close to the true data distribution), diamonds correspond to high-quality synthetic data, triangles correspond to low-quality, while stars correspond to very low-quality synthetic data. The black lines correspond to the Pareto frontiers for each level of quality of the synthetic data; the higher the frontier above the diagonal in the given setting, the more serious is the model collapse. The colorbar is the log of parametrization rate $\psi = m/n$, where $m$ captures is the size of the model.
  • Figure 2: Illustration of our new bias-variance decomposition$E_{test} \simeq B + V + \zeta$ for neural networks in the simplified random projections regime (cf. Section \ref{['sec:nn_theory']}), trained on a mixture of real and synthetic data. The sum $B + V$ corresponds to the classical bias variance decomposition in this setup when all the training data is real. The extra term $\zeta$ is responsible for model collapse when training is done on a mixture of real and synthetic data. The scalar $c^2$ characterizes the quality of the synthetic data (cf. Definition \ref{['df:quality']}), via its mismatch with the real data distribution. The vertical line corresponds to the interpolation threshold $m=n$, where $m$ is the model size and $n$ is the total sample size. Notice the well-known double-descent curve in the bias curve.
  • Figure 3: Strong model collapse in classical linear model (empirical confirmation of Corollary \ref{['cor:GBU']}). The training dataset comprises of $n=n_1+n_2$ samples from a mixture of $n_2=p_2n$ synthetic samples and $n_1=n-n_2$ real samples. The real samples are from the same distribution as the real / true samples of the training dataset, while the synthetic samples are from a distribution with the same covariance structure and label noise level $\sigma=1$, but an incorrect labelling function (epistemic error). The quality of the synthetic data is controlled by the scalar $c$, with $c \to 0$ corresponding to synthetic data of perfect quality (higher values correspond to lower quality synthetic data). Solid curves correspond to experiments, and broken curves correspond to our theoretical predictions of Corollary \ref{['cor:GBU']}; notice the perfect match. We see that even a small amount of low-quality synthetic data is enough to cause model collapse, whereby the test error of the model deviates from a perfect diagonal (ideal scaling law, corresponding to $p_2=0$, i.e training on real data only).
  • Figure 4: Impact of model size (network width $m$) on model collapse. For various levels of data quality $c^2$ (cf. Definition \ref{['df:quality']}) we show test error as a function of model size for various mixing ratios (darker curves correspond to higher fractions $p_2$ of synthetic data). Error bars correspond to 5 independent runs.
  • Figure 5: Impact of model size (network width $m$) on model collapse. As usual, solid curves correspond to experimental results (5 runs), while broken curves correspond to predictions of our theory (here, Corollary \ref{['cor:lin-overparam']}). Error bars correspond to 5 independent runs. Also see Figures \ref{['fig:bvzeta']} and \ref{['fig:like_fig_4']}.
  • ...and 5 more figures

Theorems & Definitions (20)

  • Remark 1
  • Definition 1
  • Theorem 1
  • Corollary 1
  • Remark 2
  • Definition 2
  • Theorem 2
  • Corollary 2
  • Corollary 3
  • Corollary 4
  • ...and 10 more