Strong Model Collapse
Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe
TL;DR
This work demonstrates a strong form of model collapse: even a small fraction of synthetic data in training can halt the benefits of scaling, as larger datasets do not guarantee better generalization. By developing two tractable models—the classical linear ridge and a random projections regime—the authors derive a new bias-variance decomposition, E_test ≃ B + V + ζ, where ζ captures collapse due to synthetic data. They show that the collapse term remains nonvanishing unless synthetic data is removed, and that model size effects are nuanced: bigger models can worsen collapse before interpolation but may mitigate it afterward, producing a double-descent phenomenon. Experimental validation across MNIST and language-model tasks corroborates the theory, and the work further analyzes attempts at strategic data mixing, finding single-step mixing insufficient to prevent collapse, with iterative schemes offering limited practical gains. The results underscore the importance of real data curation in scaling AI systems and provide a framework for understanding and mitigating synthetic-data-induced degradation in large-scale learning.
Abstract
Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.
