Table of Contents
Fetching ...

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

TL;DR

This work analyzes how three data-evolution workflows interact with synthetic data in self-generating settings to affect model performance. By evaluating MGM, KDE, and SFT, it demonstrates that replacing data with synthetic samples reliably causes collapse, whereas accumulating real and synthetic data prevents collapse and often stabilizes or improves performance; a fixed-compute subsample variant yields intermediate results. The findings reveal regime-dependent benefits of synthetic data: useful when real data are scarce but detrimental when real data are abundant, and they underscore the importance of compute-aware training and data curation. Together, these results offer a pragmatic framework for predicting whether frontier generative models will thrive or falter and motivate further mathematical and empirical study of synthetic-data dynamics.

Abstract

What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

TL;DR

This work analyzes how three data-evolution workflows interact with synthetic data in self-generating settings to affect model performance. By evaluating MGM, KDE, and SFT, it demonstrates that replacing data with synthetic samples reliably causes collapse, whereas accumulating real and synthetic data prevents collapse and often stabilizes or improves performance; a fixed-compute subsample variant yields intermediate results. The findings reveal regime-dependent benefits of synthetic data: useful when real data are scarce but detrimental when real data are abundant, and they underscore the importance of compute-aware training and data curation. Together, these results offer a pragmatic framework for predicting whether frontier generative models will thrive or falter and motivate further mathematical and empirical study of synthetic-data dynamics.

Abstract

What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Paper Structure

This paper contains 18 sections, 7 theorems, 33 equations, 20 figures.

Key Result

Theorem 1

For notational efficiency, for a univariate Gaussian, let $\hat{\mu}^{(t)}$ and $\hat{\sigma}^{(t)}$ denote $\hat{\mu}^{(t)}_\textrm{Accumulate}$ and $\hat{\Sigma}^{(t)}_{\textrm{Accumulate}}$. Then

Figures (20)

  • Figure 1: Model Collapse in Multivariate Gaussian Modeling.Top: Previous work shumailov2023cursealemohammad2023selfbertrand2023stability proved model collapse occurs under the replace training-workflow which iteratively fits means and covariances to data, deletes earlier data, and replaces it with samples from a Gaussian with the fitted parameters (left). However, under the accumulate workflow where one doesn't delete data after each model-fitting iteration, model collapse does not occur (right). Note: We visualize the fit Gaussians as zero-mean for easy comparison of the fit covariances across model-fitting iterations. Middle: If data are replaced, then the fitted means drift away from the original data's mean, but if data instead accumulate, then the fitted means stabilize. Bottom: If data are replaced, then the fitted covariances collapse compared to the original data's covariance, but if past data are not discarded, the fitted covariances stabilize quickly and collapse is averted.
  • Figure 2: Model Collapse in Kernel Density Estimation. Left: We consider 4 standard datasets from sklearn: Blobs, Circles, Moons and Swiss Roll. Center: For all four datasets, deleting data en masse causes the negative log likelihoods (NLL) of held-out real data to increase with each model-fitting iteration. Right: For all four datasets, the accumulate training-workflow avoids diverging test loss on real data. Interestingly, for specific pairs of datasets and numbers of samples per iteration, training on real and accumulated synthetic data can yield lower test-loss on held-out real data than would training on real data alone.
  • Figure 3: Model Collapse in Supervised Fine-tuning of Language Models. Fine-tuning Google's Gemma2 models on Nvidia's HelpSteer 2 dataset demonstrates that model collapse occurs if previous data are replaced after each model-fitting iteration (left), whereas model collapse is avoided if new synthetic data instead accumulate with previous real and synthetic data (right).
  • Figure 4: Model Collapse Under a Fixed Compute Budget. We compare deleting data after each model-fitting iteration ( replace) and accumulating data after each iteration ( accumulate) with a new fixed-compute data paradigm accumulate-subsample. In accumulate-subsample, real and synthetic data accumulate but are then subsampled so that each model is trained on a constant number of data. Accumulate-subsample's test loss on real data deteriorates more quickly than accumulate's loss but more slowly than replace's loss, and frequently converges, albeit to a higher plateau than accumulate. These results hold across five task-settings: multivariate Gaussian modeling, language model instruction finetuning, kernel density estimation, linear regression and language model pretraining.
  • Figure 5: The Value of Synthetic Data in Supervised Fine-tuning of Language Models. Fine-tuning Google's Gemma 2 2B on Nvidia's HelpSteer 2 dataset on different combinations of real and synthetic data. We observe that when the number of real data is small, supplementing with synthetic data can improve test loss. With sufficient real data, adding synthetic data degrades performance.
  • ...and 15 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Corollary 4
  • proof
  • Theorem 5
  • proof
  • Theorem 6
  • ...and 3 more