Table of Contents
Fetching ...

Language Generation with Replay: A Learning-Theoretic View of Model Collapse

Giorgio Racca, Michal Valko, Amartya Sanyal

Abstract

As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.

Language Generation with Replay: A Learning-Theoretic View of Model Collapse

Abstract

As scaling laws push the training of frontier large language models (LLMs) toward ever-growing data requirements, training pipelines are approaching a regime where much of the publicly available online text may be consumed. At the same time, widespread LLM usage increases the volume of machine-generated content on the web; together, these trends raise the likelihood of generated text re-entering future training corpora, increasing the associated risk of performance degradation often called model collapse. In practice, model developers address this concern through data cleaning, watermarking, synthetic-data policies, or, in some cases, blissful ignorance. However, the problem of model collapse in generative models has not been examined from a learning-theoretic perspective: we study it through the theoretical lens of the language generation in the limit framework, introducing a replay adversary that augments the example stream with the generator's own past outputs. Our main contribution is a fine-grained learning-theoretic characterization of when replay fundamentally limits generation: while replay is benign for the strongest notion of uniform generation, it provably creates separations for the weaker notions of non-uniform generation and generation in the limit. Interestingly, our positive results mirror heuristics widely used in practice, such as data cleaning, watermarking, and output filtering, while our separations show when these ideas can fail.
Paper Structure (25 sections, 19 theorems, 42 equations, 1 figure, 3 tables, 3 algorithms)

This paper contains 25 sections, 19 theorems, 42 equations, 1 figure, 3 tables, 3 algorithms.

Key Result

Theorem 3.1

A binary hypothesis class ${\mathcal{H}} \subseteq \left\{{0,1}\right\}^{\mathcal{X}}$ satisfying the UUS property is uniformly generatable with replay if and only if it is uniformly generatable. In particular, any generator ${\mathcal{G}}$ that generates ${\mathcal{H}}$ uniformly can be converted i

Figures (1)

  • Figure 1: Online construction of a hard hypothesis class for a given proper generator. The horizontal axis represents the hypotheses in ${\mathcal{H}}$, and the vertical axis represents the instances from the domain ${\mathcal{X}}$. For every coordinate pair $(i,j)$, a filled circle ($\bullet$) indicates $j\in\mathop{\mathrm{supp}}\left({h_i}\right)$, while an empty circle ($\circ$) indicates $j\notin\mathop{\mathrm{supp}}\left({h_i}\right)$. A box around a label on the vertical axis means that the instance has been added to the enumeration queue $Q$, while a shaded box means that the instance has been shown as an example $x_t$. Finally, the L-shaped dashed line marks the current boundaries of ${\mathcal{G}}$'s knowledge, as tracked by $I$ and $J$. We illustrate the first steps of the interaction. At initialization, the adversary inserts instance $1$ into the enumeration queue $Q$ and installs the trap hypothesis-instance pair $\left({i',j'}\right) = \left({2,2}\right)$. The counters $I$ and $J$ are both set to $2$. At step 1, the adversary reveals $x_1=1$. For illustrative purposes, we assume that at step 1 the generator ${\mathcal{G}}$ outputs $\hat{h}_1 = h_2$. This triggers the diagonalization mode of \ref{['alg:lower-bound']}: instance $d_1 = 3$ is assigned exclusively to the output hypothesis $h_2$; the current trap instance $j'=2$ is added to $Q$; a new trap hypothesis-instance pair $\left({i',j'}\right) = \left({3,4}\right)$ is created beyond $I$ and $J$ by assigning instance $e_1=4$ to all hypotheses except for $h_3$; finally, instance $c_1=5$ is assigned to all hypotheses and is therefore added to the enumeration queue. When the round ends, the counters $I$ and $J$ are set to $3$ and $5$, respectively. Then step 2 begins with $x_2=\min Q = 2$ being revealed to ${\mathcal{G}}$. We assume that ${\mathcal{G}}$ queries $F(4,6)$: instance $6$ is therefore assigned to all hypotheses and added to $Q$. Furthermore, the counters $I$ and $J$ move to $4$ and $6$, respectively. Suppose ${\mathcal{G}}$ outputs $\hat{h}_2 = h_1$. This time the overgeneralization mode of \ref{['alg:lower-bound']} is triggered. In this case, the trap hypothesis-instance pair remains the same. At the end of the round, $c_2=7$ is added to $Q$ and the counter $J$ is updated to $7$.

Theorems & Definitions (47)

  • Definition 2.1: Sequence with replay for a hypothesis and a generator
  • Definition 2.2: Uniform generatability with replay
  • Definition 2.3: Non-uniform generatability with replay
  • Definition 2.4: Generatability in the limit with replay
  • Definition 2.5: Proper generatability in the limit with replay
  • Theorem 3.1: Equivalence of uniform generation with and without replay
  • proof
  • Theorem 4.1: Hardness of non-uniform generation with replay
  • proof
  • Theorem 5.1
  • ...and 37 more