Table of Contents
Fetching ...

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

Shi Fu, Yingjie Wang, Yuzhu Chen, Xinmei Tian, Dacheng Tao

TL;DR

This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs and extends this analysis to transformers in in-context learning.

Abstract

High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

TL;DR

This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs and extends this analysis to transformers in in-context learning.

Abstract

High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

Paper Structure

This paper contains 23 sections, 12 theorems, 126 equations, 1 figure.

Key Result

Theorem 1

Assume that $\mathcal{A}$ is a $\beta_n$-uniformly stable learning algorithm and the loss function $\ell$ is bounded by $M$. Let $n$ represent the sample size of the mixed dataset $\widetilde{S}_j$, defined as $\widetilde{S}_j=\alpha S_0+(1-\alpha) S_j$ for $1 \leq j \leq i$, where $0<\alpha\leq 1$ where $\gamma_n^i= \sup_{j}TV(\mathcal{D}_{i}^{n(1-\alpha)}(S_{0}'),\mathcal{D}_{i}^{n(1-\alpha)}(S

Figures (1)

  • Figure 1: Self-consuming Training Loops: The initial model $\mathcal{G}_0$ is trained on the real dataset $S_0$. For generation $1 \leq j \leq i$, the model $\mathcal{G}_j$ is trained on the mixed dataset $\widetilde{S}_j$.

Theorems & Definitions (30)

  • Remark 1
  • Definition 1
  • Definition 2
  • Theorem 1: General Generalization Bound
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • ...and 20 more