Table of Contents
Fetching ...

Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set

Lecheng Wang, Xianjie Shi, Ge Li, Jia Li, Xuanming Zhang, Yihong Dong, Wenpin Jiao, Hong Mei

TL;DR

This work addresses the risk that auto-regressive LMs collapse when trained on recursively generated data drawn from a finite real-world corpus. It provides a formal theoretical proof under two data paradigms—Replace and Accumulate-Subsample—that the LM output distribution $\hat{p}_n(v_i|\boldsymbol{x})$ converges to a function built from accumulated per-generation errors, rather than the original data distribution, as $n$ grows; this proves collapse regardless of the synthetic-data rate. The contributions include precise definitions, a Main Theorem with closed-form expressions for $\hat{p}_n(v_i|\boldsymbol{x})$ under both paradigms, and empirical evidence using TinyStories with GPT-Neo models demonstrating growing divergence (perplexity) from the initial corpus. The findings imply that avoiding collapse hinges on improving synthetic data quality and filtering, rather than merely limiting synthetic data, with practical implications for data curation and sustainable LM training.

Abstract

Auto-regressive language models (LMs) have been widely used to generate data in data-scarce domains to train new LMs, compensating for the scarcity of real-world data. Previous work experimentally found that LMs collapse when trained on recursively generated data. This paper presents a theoretical proof: once a corpus (such as a subset of the World Wide Web) begins to incorporate generated data and no new real-world data is added to the corpus, then no matter how small the amount of data each LM generates and contributes to the corpus, LM collapse is inevitable after sufficient time. This finding suggests that attempts to mitigate collapse by limiting the quantity of synthetic data in the corpus are fundamentally insufficient. Instead, avoiding collapse hinges on ensuring the quality of synthetic data.

Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set

TL;DR

This work addresses the risk that auto-regressive LMs collapse when trained on recursively generated data drawn from a finite real-world corpus. It provides a formal theoretical proof under two data paradigms—Replace and Accumulate-Subsample—that the LM output distribution converges to a function built from accumulated per-generation errors, rather than the original data distribution, as grows; this proves collapse regardless of the synthetic-data rate. The contributions include precise definitions, a Main Theorem with closed-form expressions for under both paradigms, and empirical evidence using TinyStories with GPT-Neo models demonstrating growing divergence (perplexity) from the initial corpus. The findings imply that avoiding collapse hinges on improving synthetic data quality and filtering, rather than merely limiting synthetic data, with practical implications for data curation and sustainable LM training.

Abstract

Auto-regressive language models (LMs) have been widely used to generate data in data-scarce domains to train new LMs, compensating for the scarcity of real-world data. Previous work experimentally found that LMs collapse when trained on recursively generated data. This paper presents a theoretical proof: once a corpus (such as a subset of the World Wide Web) begins to incorporate generated data and no new real-world data is added to the corpus, then no matter how small the amount of data each LM generates and contributes to the corpus, LM collapse is inevitable after sufficient time. This finding suggests that attempts to mitigate collapse by limiting the quantity of synthetic data in the corpus are fundamentally insufficient. Instead, avoiding collapse hinges on ensuring the quality of synthetic data.

Paper Structure

This paper contains 36 sections, 4 theorems, 51 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 2.1

For any specific token sequence $\boldsymbol x$ and token $v_i$, along with corresponding $\alpha[j]$ and $\alpha_i[j]$, under the Replace paradigm, Under the Accumulate-Subsample paradigm, where $k$, as defined in sec: subsample, is the rate at which generated data is added to the corpus.

Figures (3)

  • Figure 1: Composition of the training set for each generation of language models (LMs) under different data paradigms. On the left is Replace, and on the right is a special case of Accumulate-Subsample when $k = 1$. From top to bottom, each rectangle represents the composition of the training set of the 1st, 2nd, 3rd, and 4th generation LMs, respectively. $\mathcal{D}_0$, $\mathcal{S}_1$, $\mathcal{S}_2$, and $\mathcal{S}_3$ represent the initial corpus, the data generated by the 1st-generation LM, by the 2nd-generation LM, and by the 3rd-generation LM, respectively. The original figure is Figure 2 in the paper by Suffer.
  • Figure 2: Language model (LM) collapse is a degenerative process whereby, over generations, LMs forget the underlying text distribution of the initial training corpus. Our experiment begins with an initial corpus used to train the LM at generation 1. Then, generation 2 is trained using the text generated by generation 1, generation 3 using the text generated by generation 2, and so on. The \ref{['fig: output-33M']} depicts the process of LM collapse. The dotted lines True $p(\text{was}|\text{there})$ and True $p(\text{were}|\text{there})$ refer to the probabilities of two phrases 'there was' and 'there were' given 'there' in the initial corpus TinyStories, which are 0.76 and 0.05 respectively. The first-generation LM with 33 million parameters trained on it can learn this probability well (the solid lines $p(\text{was}|\text{there})$ and $p(\text{were}|\text{there})$ represent the probabilities that the LMs predicts the next token after 'there' to be 'was' and 'were' respectively). However, as $n$ increases, the solid lines gradually deviate from the dotted lines, underscoring the growing disconnect between the output of LMs and the initial training text.
  • Figure 3: Perplexity of LMs with 1M and 33M parameters over 40 generations evaluated on the validation set of the TinyStories dataset. Model perplexity is not only an indicator of the performance of LMs, but also a measure of the 'distance' between the distributions learned by the LM and the distributions of the initial corpus. (The formula for calculating perplexity can be found in \ref{['sec: perplexity']}.) The increase in perplexity is due to the increase in this distance, which means that the learned distribution is becoming increasingly deviated from the initial corpus. This corroborates our theoretical result.

Theorems & Definitions (9)

  • Theorem 2.1: LM collapse
  • Remark 2.2
  • Proposition 2.3
  • proof
  • Proposition 2.4
  • proof
  • Proposition 2.5
  • Remark 2.6
  • Remark B.1