Table of Contents
Fetching ...

A Probabilistic Perspective on Model Collapse

Shirong Xu, Hengzhi He, Guang Cheng

TL;DR

The paper presents a probabilistic framework for model collapse in recursive training with synthetic data, modeling the parameter trajectory as a random walk. It identifies a critical role for the synthetic data schedule, showing that superlinear growth $O(t^{1+s})$ (and faster under bias) is often needed to prevent collapse, and provides a general expression for the probability that recursive synthetic training improves over real-data training under Gaussian and asymptotically normal estimators. It extends these results to general parametric families and validates them with simulations and a real tabular-data study, offering principled data-expansion guidelines that balance computation and model utility. The work contributes actionable insights into when and how synthetic data can help or hinder learning in iterative generative-model training and provides a rigorous basis for designing expansion schedules in practice.

Abstract

In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.

A Probabilistic Perspective on Model Collapse

TL;DR

The paper presents a probabilistic framework for model collapse in recursive training with synthetic data, modeling the parameter trajectory as a random walk. It identifies a critical role for the synthetic data schedule, showing that superlinear growth (and faster under bias) is often needed to prevent collapse, and provides a general expression for the probability that recursive synthetic training improves over real-data training under Gaussian and asymptotically normal estimators. It extends these results to general parametric families and validates them with simulations and a real tabular-data study, offering principled data-expansion guidelines that balance computation and model utility. The work contributes actionable insights into when and how synthetic data can help or hinder learning in iterative generative-model training and provides a rigorous basis for designing expansion schedules in practice.

Abstract

In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training from a probabilistic perspective, aiming to characterize the conditions under which model collapse occurs and, crucially, how it can be mitigated. We conceptualize the recursive training process as a random walk of the model estimate, highlighting how the sample size influences the step size and how the estimation procedure determines the direction and potential bias of the random walk. Under mild conditions, we rigorously show that progressively increasing the sample size at each training step is necessary to prevent model collapse. In particular, when the estimation is unbiased, the required growth rate follows a superlinear pattern. This rate needs to be accelerated even further in the presence of substantial estimation bias. Building on this probabilistic framework, we also investigate the probability that recursive training on synthetic data yields models that outperform those trained solely on real data. Moreover, we extend these results to general parametric model family in an asymptotic regime. Finally, we validate our theoretical results through extensive simulations and a real-world dataset.

Paper Structure

This paper contains 20 sections, 226 equations, 11 figures.

Figures (11)

  • Figure 1: Model Collapse in Recursive Training Framework shumailov2024ai.
  • Figure 2: A General Framework for Recursive Training with Fully Synthetic Data
  • Figure 3: Experimental Setup for Recursive Gaussian Estimation: We fix parameters $(n, \mu, \sigma^2) = (100, 0, 1)$ and vary $T \in \{100, 200, 300, 400, 500\}$. For each $T$, we conduct $10^4$ replications, recording the estimate $\widehat{\sigma}_{T,i}^2$ for each replication. We then report the percentage of replications with $\widehat{\sigma}_{T,i}^2 \leq 0.05$, the maximum value $\max_{i} \widehat{\sigma}_{T,i}^2$, and the estimated population risk across all replications.
  • Figure 4: An illustration of recursive training represented as a random walk.
  • Figure 5: An illustration of recursive training represented as a random walk.
  • ...and 6 more figures