Model Collapse Demystified: The Case of Regression
Elvis Dohmatob, Yunzhen Feng, Julia Kempe
TL;DR
This work develops a theoretical framework for model collapse in high-dimensional regression by analyzing iterative retraining on synthetic data generated from previous model generations. Grounded in Gaussian design and extended to kernel Ridge methods, it derives exact test-error formulas and shows how synthetic data induce additional bias and altered scaling laws, particularly under power-law covariance spectra. It introduces adaptive ridge regularization to mitigate collapse, with explicit guidance on choosing regularization exponents, and validates the theory through simulations and MNIST-based kernel experiments. The results reveal fundamental changes to learning dynamics in the presence of self-generated data and offer practical strategies for maintaining performance in synthetic-data regimes. Overall, the paper provides a rigorous, scalable account of how model-generated data can deteriorate learning and how to counteract it through principled regularization.
Abstract
In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless, i.e the model collapses. In this work, we study this phenomenon in the setting of high-dimensional regression and obtain analytic formulae which quantitatively outline this phenomenon in a broad range of regimes. In the special case of polynomial decaying spectral and source conditions, we obtain modified scaling laws which exhibit new crossover phenomena from fast to slow rates. We also propose a simple strategy based on adaptive regularization to mitigate model collapse. Our theoretical results are validated with experiments.
