Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set
Lecheng Wang, Xianjie Shi, Ge Li, Jia Li, Xuanming Zhang, Yihong Dong, Wenpin Jiao, Hong Mei
TL;DR
This work addresses the risk that auto-regressive LMs collapse when trained on recursively generated data drawn from a finite real-world corpus. It provides a formal theoretical proof under two data paradigms—Replace and Accumulate-Subsample—that the LM output distribution $\hat{p}_n(v_i|\boldsymbol{x})$ converges to a function built from accumulated per-generation errors, rather than the original data distribution, as $n$ grows; this proves collapse regardless of the synthetic-data rate. The contributions include precise definitions, a Main Theorem with closed-form expressions for $\hat{p}_n(v_i|\boldsymbol{x})$ under both paradigms, and empirical evidence using TinyStories with GPT-Neo models demonstrating growing divergence (perplexity) from the initial corpus. The findings imply that avoiding collapse hinges on improving synthetic data quality and filtering, rather than merely limiting synthetic data, with practical implications for data curation and sustainable LM training.
Abstract
Auto-regressive language models (LMs) have been widely used to generate data in data-scarce domains to train new LMs, compensating for the scarcity of real-world data. Previous work experimentally found that LMs collapse when trained on recursively generated data. This paper presents a theoretical proof: once a corpus (such as a subset of the World Wide Web) begins to incorporate generated data and no new real-world data is added to the corpus, then no matter how small the amount of data each LM generates and contributes to the corpus, LM collapse is inevitable after sufficient time. This finding suggests that attempts to mitigate collapse by limiting the quantity of synthetic data in the corpus are fundamentally insufficient. Instead, avoiding collapse hinges on ensuring the quality of synthetic data.
