Table of Contents
Fetching ...

Universality of the $π^2/6$ Pathway in Avoiding Model Collapse

Apratim Dey, David Donoho

TL;DR

The paper addresses the risk of model collapse in iterative training with real and synthetic data, contrasting discard versus augment workflows. It develops a universal theoretical framework based on contiguity and Le Cam's Lemma, showing that under broad exponential-family settings with AAL estimators, the augment workflow yields an asymptotic relative efficiency bounded below by $6/\pi^2 \approx 0.607$, while the discard workflow degrades without bound. A central theorem demonstrates a sequential Gaussian-process limit for model fitting, enabling a unified analysis across diverse models (e.g., linear and logistic) and data-generation schemes. Empirical results on UCI datasets and SSL-derived features corroborate the theory, illustrating faster deterioration under discard and more stable performance under augmentation. The framework provides a practical, simulation-based tool for comparing workflows and offers insight into the fundamental mechanisms that prevent collapse when augmenting with synthetic data.

Abstract

Researchers in empirical machine learning recently spotlighted their fears of so-called Model Collapse. They imagined a discard workflow, where an initial generative model is trained with real data, after which the real data are discarded, and subsequently, the model generates synthetic data on which a new model is trained. They came to the conclusion that models degenerate as model-fitting generations proceed. However, other researchers considered an augment workflow, where the original real data continue to be used in each generation of training, augmented by synthetic data from models fit in all earlier generations. Empirical results on canonical datasets and learning procedures confirmed the occurrence of model collapse under the discard workflow and avoidance of model collapse under the augment workflow. Under the augment workflow, theoretical evidence also confirmed avoidance in particular instances; specifically, Gerstgrasser et al. (2024) found that for classical Linear Regression, test risk at any later generation is bounded by a moderate multiple, viz. pi-squared-over-6 of the test risk of training with the original real data alone. Some commentators questioned the generality of theoretical conclusions based on the generative model assumed in Gerstgrasser et al. (2024): could similar conclusions be reached for other task/model pairings? In this work, we demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models, offering key insights into exactly why collapse happens under the discard workflow and is avoided under the augment workflow. In the process, we provide a framework that is able to accommodate a large variety of workflows (beyond discard and augment), thereby enabling an experimenter to judge the comparative merits of multiple different workflows by simulating a simple Gaussian process.

Universality of the $π^2/6$ Pathway in Avoiding Model Collapse

TL;DR

The paper addresses the risk of model collapse in iterative training with real and synthetic data, contrasting discard versus augment workflows. It develops a universal theoretical framework based on contiguity and Le Cam's Lemma, showing that under broad exponential-family settings with AAL estimators, the augment workflow yields an asymptotic relative efficiency bounded below by , while the discard workflow degrades without bound. A central theorem demonstrates a sequential Gaussian-process limit for model fitting, enabling a unified analysis across diverse models (e.g., linear and logistic) and data-generation schemes. Empirical results on UCI datasets and SSL-derived features corroborate the theory, illustrating faster deterioration under discard and more stable performance under augmentation. The framework provides a practical, simulation-based tool for comparing workflows and offers insight into the fundamental mechanisms that prevent collapse when augmenting with synthetic data.

Abstract

Researchers in empirical machine learning recently spotlighted their fears of so-called Model Collapse. They imagined a discard workflow, where an initial generative model is trained with real data, after which the real data are discarded, and subsequently, the model generates synthetic data on which a new model is trained. They came to the conclusion that models degenerate as model-fitting generations proceed. However, other researchers considered an augment workflow, where the original real data continue to be used in each generation of training, augmented by synthetic data from models fit in all earlier generations. Empirical results on canonical datasets and learning procedures confirmed the occurrence of model collapse under the discard workflow and avoidance of model collapse under the augment workflow. Under the augment workflow, theoretical evidence also confirmed avoidance in particular instances; specifically, Gerstgrasser et al. (2024) found that for classical Linear Regression, test risk at any later generation is bounded by a moderate multiple, viz. pi-squared-over-6 of the test risk of training with the original real data alone. Some commentators questioned the generality of theoretical conclusions based on the generative model assumed in Gerstgrasser et al. (2024): could similar conclusions be reached for other task/model pairings? In this work, we demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models, offering key insights into exactly why collapse happens under the discard workflow and is avoided under the augment workflow. In the process, we provide a framework that is able to accommodate a large variety of workflows (beyond discard and augment), thereby enabling an experimenter to judge the comparative merits of multiple different workflows by simulating a simple Gaussian process.

Paper Structure

This paper contains 51 sections, 6 theorems, 71 equations, 6 figures.

Key Result

Theorem 5.1

Make Assumptions 1, 2 and 3. Define, for any $G\geq 1$, Consider a reference distribution ${\mathbb{P}}^\texttt{ref}$ that assumes that at each generation $G\geq 1$, the data points (accumulated so far) in ${\mathcal{Z}}_G$ are iid. That is, under ${\mathbb{P}}^\texttt{ref}$, for every $G\geq 1$ and $1\leq i\leq n$, $X_{G,i}$ are iid from $H$ and Then, the following hold.

Figures (6)

  • Figure 1: Plots showing ratio of variances of limit Gaussian variables across generations for three different workflows. The plot on the right is a zoomed-in version of the plot on the left (excluding the curve corresponding to discard). It is clear that only discard workflow exhibits exploding variance ratio, linearly as predicted by theory. The augment workflow quickly concentrates around $\pi^2/6$. The augment-subsample workflow also seems to plateau albeit at a higher value.
  • Figure 2: Classification test losses on four datasets over 50 model-training iterations using iteratively fit logistic regression: (in clockwise order) Diabetes, Heart, Titanic, and Wisconsin. Red denotes the curve with discard workflow, whereas blue denotes the corresponding curve with the augment workflow. We can see that the red curves are significantly higher than the blue curves in all the cases. In fact, the blue curves barely seem to increase in comparison to the red curves.
  • Figure 3: Classification test losses with four SSL trained features obtained from ResNet-50 applied on CIFAR-10. Red denotes discard while blue denotes augment. The blue curves clearly increase at a slower rate.
  • Figure 4: Classification test accuracies on four datasets over 50 model-training iterations using iteratively fit logistic regression: (in clockwise order) Diabetes, Heart, Titanic, and Wisconsin. Red denotes the curve with discard workflow, whereas blue denotes the corresponding curve with the augment workflow. We can see that the red curves detoriate much faster than the blue curves.
  • Figure 5: Classification test losses for six SSL trained features obtained from ResNet-50 applied on CIFAR-10. Red denotes discard while blue denotes augment. The blue curves clearly increase at a slower rate.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 5.1
  • Lemma 5.2: Discard workflow
  • Lemma 5.3: Augment workflow
  • Lemma 5.4
  • Lemma 5.5
  • Theorem 8.1