Table of Contents
Fetching ...

Model Collapse Demystified: The Case of Regression

Elvis Dohmatob, Yunzhen Feng, Julia Kempe

TL;DR

This work develops a theoretical framework for model collapse in high-dimensional regression by analyzing iterative retraining on synthetic data generated from previous model generations. Grounded in Gaussian design and extended to kernel Ridge methods, it derives exact test-error formulas and shows how synthetic data induce additional bias and altered scaling laws, particularly under power-law covariance spectra. It introduces adaptive ridge regularization to mitigate collapse, with explicit guidance on choosing regularization exponents, and validates the theory through simulations and MNIST-based kernel experiments. The results reveal fundamental changes to learning dynamics in the presence of self-generated data and offer practical strategies for maintaining performance in synthetic-data regimes. Overall, the paper provides a rigorous, scalable account of how model-generated data can deteriorate learning and how to counteract it through principled regularization.

Abstract

In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting of high-dimensional regression and obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments.

Model Collapse Demystified: The Case of Regression

TL;DR

This work develops a theoretical framework for model collapse in high-dimensional regression by analyzing iterative retraining on synthetic data generated from previous model generations. Grounded in Gaussian design and extended to kernel Ridge methods, it derives exact test-error formulas and shows how synthetic data induce additional bias and altered scaling laws, particularly under power-law covariance spectra. It introduces adaptive ridge regularization to mitigate collapse, with explicit guidance on choosing regularization exponents, and validates the theory through simulations and MNIST-based kernel experiments. The results reveal fundamental changes to learning dynamics in the presence of self-generated data and offer practical strategies for maintaining performance in synthetic-data regimes. Overall, the paper provides a rigorous, scalable account of how model-generated data can deteriorate learning and how to counteract it through principled regularization.

Abstract

In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting of high-dimensional regression and obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments.
Paper Structure (44 sections, 17 theorems, 100 equations, 5 figures)

This paper contains 44 sections, 17 theorems, 100 equations, 5 figures.

Key Result

Theorem 4.1

For an $n$-fold fake data generation process with $T_0 \ge d+2$ samples, the test error for the linear predictor $\widehat{w}_n^{pred}$ given in eq:ridge learned on $T \ge d+2$ samples, with $\lambda=0$ (i.e unregularized), is given by

Figures (5)

  • Figure 1: Demystifying model collapse in ridge regression (isotropic covariance spectrum). We show the evolution of test error for different sample size ($T$), different levels of ridge-regularization ($\lambda$), and training data from different generations ($n$) of fake data. The setup is: input-dimension $d=300$, sample size for fake data generator $T_0=600$, noise levels $\sigma=0.1$ and $\sigma_0 = 0.2$. Left plot is for $T=1000$ and different values of $\lambda$. Notice the U-shape of the curves for large values of $n$, indicating the existence of a sweet spot (optimal regularization parameter). Right plot is for $\lambda = 10^{-3}$ and different values of $T$. Error bars correspond to uncertainty induced by the data-generating process, over different runs. The broken lines correspond to the theoretical result established in Theorem \ref{['thm:linreg']}.
  • Figure 2: Demystifying model collapse in ridge regression (power-law covariance spectrum). The setup is: $d=300$, $T_0=600$, $\sigma=\sigma_0=1$, $\Sigma=\mathrm{diag}(\lambda_1,\ldots,\lambda_d)$, where $\lambda_k \propto k^{-2}$. Left plot corresponds to $T=10,000$ and Right plot corresponds to adaptive regularization $\lambda=T^{-\ell_{crit}}$, where $\lambda=\lambda(T)$ as proposed in Cui_2022. See Section \ref{['sec:exp']} for details. The broken curves are as predicted by our Theorem \ref{['thm:darkseid']}. Though $\ell=\ell_{crit}$ is optimal in classical case, it is not in the setup of model collapse. In fact here, the test error diverges with sample size $T$. Our theory proposes a corrected value of this exponent which gracefully adapts to synthesized data.
  • Figure 3: Model collapse in the case of noiseless over-parametrized synthetic data generator. Here $d=300$, the sample sizes for the different versions of the fake data generator are equal, i.e $T_n=T_0=d/2$ for all $n$, and noise levels are $\sigma_0=0$ and $\sigma=0.1$. Everything else is as in the setting of Figure \ref{['fig:linreg']}. Broken lines correspond to the theoretical estimates given in Theorem \ref{['thm:rho']}. As predicted by our theory, the test error of the model fitted on synthetic data ($n \ge 1$) increases (relative to the baseline $n=0$, corresponding to training on clean data). The model collapse here, even in the absence of noise ($\sigma_0=0$), is due to the fact that the synthetic data-generator does not have access to enough data to capture the true labelling function. (a) Importantly, and in accordance to our theory, the amount of model collapse in the case $X_n \equiv X_0$ is due to an increase in bias term of the test error of the model and does not depend on the number of generations $n$ as long as $n \ge 1$. (b) In contrast, for the case where the $X_n$'s are independent, the increase in bias term grows with $n$, leading to "catastrophic" model collapse (Theorem \ref{['thm:cata']}).
  • Figure 4: Illustration of the theoretical framework. The process begins with the original model $\widehat{w}_0 (w_0)$ and the original dataset $(X_0, \overline{Y}_0)$. $n$ synthetic data generators $\widehat{w}_1$ to $\widehat{w}_n$ are iteratively fit on data labelled by the previous model with label noise $\sigma_0$, using $T_0$ samples each. We evaluate the test error of $\widehat{w}_n^{pred}$ (with respect to the ground truth labels from $w_0$), which is trained on $(X,Y):=(X_n, \overline{Y}_n)$ using $T$ samples and a regularization coefficient $\lambda$.
  • Figure 5: Demystifying model collapse in kernel ridge regression (power-law covariance spectrum) on MNIST. Here, we use adaptive regularization $T^{-\ell}$ for different values of the exponent $\ell \ge 0$ (see Section \ref{['sec:exp']} for full experimental setup). Top row: RBF kernel. Bottom row: polynomial kernel. In each plot, we show test error curves as a function of sample size $T$, from different generations ($n$) of fake data. The broken vertical line corresponds to $T=T_0$, where $T_0$ is the number of samples (from the true data distribution) which was used to train the label faker. The value of the exponent regularization $\ell=\ell_{\star}$ (broken curves) is the optimal value in the presence of iterative data relabeling, while $\ell=\ell_{crit}$ (solid curves) corresponds to the optimal value without iterative re-labelling (i.e $n=0$) proposed in Cui_2022 (see \ref{['eq:lcrit']}). Specifically, we take $\ell_\star=(b-a)\ell_{cirt} = b\ell_{crit}$, where $b=\log T_0 / \log T$ (so that $T_0 = T^b$), as proposed in Theorem \ref{['thm:darkseid']}, formula \ref{['eq:lopt']}. Notice how the effect of fake data makes the test error become non decreasing in sample size $T$. This is effectively a collapse of the learned model.

Theorems & Definitions (25)

  • Theorem 4.1
  • Remark 4.2
  • Theorem 4.3
  • Remark 4.4
  • Proposition 4.5
  • Theorem 4.6
  • Corollary 4.7
  • Theorem 4.8
  • Theorem 4.9
  • Remark 5.1
  • ...and 15 more