From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources
Soham Bakshi, Sunrit Chakraborty
TL;DR
The paper analyzes the long-term dynamics of iterative training when data come from a mixture of the true distribution and synthetic sources, formalizing a population-level framework with $P_{t+1}=\alpha_{t+1}P^*+(1-\alpha_{t+1})\hat{P}_t$. It derives the exact error evolution for a simplified multinomial/next-token setting via $R_t=\frac{n_0}{n_t}R_0+\frac{n_t-1}{n_t}(1-\alpha_t)^2R_{t-1}$ and identifies regimes where consistency or improvement is possible, highlighting the crucial role of fresh real data and sample-size scaling. Through theoretical results and extensive simulations (multinomial models and GPT-2–scale experiments), it shows that without ongoing real data infusion, improvement is not guaranteed, but with sufficiently large real-data fraction or appropriately growing sample sizes, the estimator can converge to the true distribution. The work also examines data-aggregation vs. real-data filtration, and discusses adaptive schemes and future directions like RLHF and co-evolution of data distributions, offering practical predictions for maintaining long-term model quality in self-referential training loops.
Abstract
The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.
