Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan; Rylan Schaeffer; Apratim Dey; Matthias Gerstgrasser; Rafael Rafailov; David L. Donoho; Sanmi Koyejo

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

TL;DR

This work analyzes how three data-evolution workflows interact with synthetic data in self-generating settings to affect model performance. By evaluating MGM, KDE, and SFT, it demonstrates that replacing data with synthetic samples reliably causes collapse, whereas accumulating real and synthetic data prevents collapse and often stabilizes or improves performance; a fixed-compute subsample variant yields intermediate results. The findings reveal regime-dependent benefits of synthetic data: useful when real data are scarce but detrimental when real data are abundant, and they underscore the importance of compute-aware training and data curation. Together, these results offer a pragmatic framework for predicting whether frontier generative models will thrive or falter and motivate further mathematical and empirical study of synthetic-data dynamics.

Abstract

What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

TL;DR

Abstract

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (13)