Table of Contents
Fetching ...

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

Soham Bakshi, Sunrit Chakraborty

TL;DR

The paper analyzes the long-term dynamics of iterative training when data come from a mixture of the true distribution and synthetic sources, formalizing a population-level framework with $P_{t+1}=\alpha_{t+1}P^*+(1-\alpha_{t+1})\hat{P}_t$. It derives the exact error evolution for a simplified multinomial/next-token setting via $R_t=\frac{n_0}{n_t}R_0+\frac{n_t-1}{n_t}(1-\alpha_t)^2R_{t-1}$ and identifies regimes where consistency or improvement is possible, highlighting the crucial role of fresh real data and sample-size scaling. Through theoretical results and extensive simulations (multinomial models and GPT-2–scale experiments), it shows that without ongoing real data infusion, improvement is not guaranteed, but with sufficiently large real-data fraction or appropriately growing sample sizes, the estimator can converge to the true distribution. The work also examines data-aggregation vs. real-data filtration, and discusses adaptive schemes and future directions like RLHF and co-evolution of data distributions, offering practical predictions for maintaining long-term model quality in self-referential training loops.

Abstract

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

TL;DR

The paper analyzes the long-term dynamics of iterative training when data come from a mixture of the true distribution and synthetic sources, formalizing a population-level framework with . It derives the exact error evolution for a simplified multinomial/next-token setting via and identifies regimes where consistency or improvement is possible, highlighting the crucial role of fresh real data and sample-size scaling. Through theoretical results and extensive simulations (multinomial models and GPT-2–scale experiments), it shows that without ongoing real data infusion, improvement is not guaranteed, but with sufficiently large real-data fraction or appropriately growing sample sizes, the estimator can converge to the true distribution. The work also examines data-aggregation vs. real-data filtration, and discusses adaptive schemes and future directions like RLHF and co-evolution of data distributions, offering practical predictions for maintaining long-term model quality in self-referential training loops.

Abstract

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
Paper Structure (26 sections, 11 theorems, 95 equations, 6 figures)

This paper contains 26 sections, 11 theorems, 95 equations, 6 figures.

Key Result

Theorem 3.1

In the above setting, the sequence $\left( R_t \right)_{t\geq 1}$(as defined above) satisfies the following recurrence where $R_0=\frac{\sum_k \theta^*(k)(1-\theta^*(k))}{n_{0}}=\frac{1-\Vert\theta^*\Vert_2^2}{n_{0}}.$

Figures (6)

  • Figure 1: $P^*$ is the true underlying true data distribution, at time $t$, model $M_t$ is trained on data $D_t$, where $D_0$ is purely a sample from $P^*$, but subsequently $D_t$ consists of (i) synthetic part, (ii) accumulated part, (iii) fresh information. Can $P_t$ approximate $P^*$ well? In parametric setting, $P_t=P_{\hat{\theta}_t}$.
  • Figure 2: Iterative evolution of estimation quality for multinomial model under data arising from a mixture of a fixed ground-truth distribution and synthetic data generated from the trained model at previous iteration under various settings for $\alpha_t$ and $n_t$ -- top row shows the results without accumulation, bottom row shows results with accumulation, The black dashed horizontal line is the value of $R_0$
  • Figure 3: Iterative evolution of estimation quality for multinomial model under purely synthetic regime, when models are trained on accumulated data.
  • Figure 4: Iterative evolution of GPT2-like language model evaluated based on perplexity on held-out data corpus in 4 different settings. The two plots are of the same 4 settings (blue: setting 0, orange: setting 1, green: setting 2 and red: setting3), just different scales in the y-axis.
  • Figure 5: Visual Depiction of Theorem \ref{['thm:one_step2']}: Fresh information can improve estimation in the context of using MLE given a model class $\mathcal{P}$, where the MLE can be seen as estimating the KL-projection of the $Q_1$ on the model space. $P_{\theta^*}$ is the projection of the true data distribution $P^*$.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Theorem 3.1
  • Corollary 3.2
  • Remark 3.3
  • Proposition 3.4
  • Remark 3.5
  • Corollary 3.6
  • Proposition 3.7
  • Proposition 3.8
  • Theorem 4.1
  • Theorem 4.2
  • ...and 15 more