Table of Contents
Fetching ...

Provable Long-Range Benefits of Next-Token Prediction

Xinyuan Cao, Santosh S. Vempala

TL;DR

The paper establishes a formal, complexity-theoretic link between training a language model via next-token prediction and the model’s long-range coherence. By introducing the notions of next-k-token distinguishers and indistinguishability, it proves that a model trained to minimize next-token loss becomes statistically indistinguishable from the training distribution for all bounded windows, regardless of document length. The core mechanism combines a distinguisher-based boosting framework with efficient RNN implementations, including synchronized enumeration, to yield polynomial-size models whose loss decreases imply improved KL distance to the truth. It further extends these results to bounded-bit-size computations, showing practical space limits do not destroy indistinguishability guarantees. Collectively, the work provides a rigorous explanation for why next-token prediction captures long-range structure and coherence in autoregressive language models, with concrete complexity bounds and guidance for scaling and implementation.

Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

Provable Long-Range Benefits of Next-Token Prediction

TL;DR

The paper establishes a formal, complexity-theoretic link between training a language model via next-token prediction and the model’s long-range coherence. By introducing the notions of next-k-token distinguishers and indistinguishability, it proves that a model trained to minimize next-token loss becomes statistically indistinguishable from the training distribution for all bounded windows, regardless of document length. The core mechanism combines a distinguisher-based boosting framework with efficient RNN implementations, including synchronized enumeration, to yield polynomial-size models whose loss decreases imply improved KL distance to the truth. It further extends these results to bounded-bit-size computations, showing practical space limits do not destroy indistinguishability guarantees. Collectively, the work provides a rigorous explanation for why next-token prediction captures long-range structure and coherence in autoregressive language models, with concrete complexity bounds and guidance for scaling and implementation.

Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next tokens, for any , can distinguish between consecutive tokens of such documents and tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in , independent of the document length) on the model size needed to achieve such -token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

Paper Structure

This paper contains 39 sections, 29 theorems, 229 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

For any $0<\epsilon<1, k,\tau,\mathcal{d}\in\mathbb{N}$, for alphabet size $|\Sigma|=O(1)$, with probability at least $0.9$, by trying only two model sizes and minimizing next-token loss, we can output an LM $q$ with the following properties:

Figures (6)

  • Figure 1.1: Illustration of the boosting construction for $q'$ in Lemma \ref{['lemma:lm_update_decrease_kl']}. The axis is the index of the text, ranging from $1$ to $n$. The new model $q'$ behaves identically to the original $q$ until $i_0^*$. After that, it repeatedly applies a reweighting $qe^{-\alpha d}/Z$ over subsequent length-$k$ blocks, starting at $i_0^*+1$.
  • Figure 1.2: Two examples of RNNs, and their corresponding unrolled feedforward networks. In both RNNs, $h$ is the hidden node set. The subscript indicates the input index. RNN $a$ receives a new input $x_i$ at each time step $i$. Thus, the output corresponding to the input $x_{:i+1}$ is computed at time $i+2$. RNN $b$ receives a new input $x_i$ and holds it for three consecutive time steps $(t=3i,3i+1,3i+2)$. This is managed by a control node $v\in\{0,1,2\}$. Thus, the output corresponding to the input $x_{:i+1}$ is computed at time $3i+3$.
  • Figure 2.1: A sketch of the original RNN $Q$, and the constructed RNN $U$. The RNN $Q$ has a hidden node set $H_Q$, and the remaining nodes $R_Q$. The RNN $U$ maintains some counter nodes in Claim \ref{['claim:lem_proof_counters']}, a node set $Y$ that stores the input subsequence $x_{i_0+1:i_1+1}$ in Claim \ref{['claim:lem_proof_input_set']}, a node set $E$ to enumerates the digits of all length-$k$ strings in Claim \ref{['claim:lem_proof_enumerater']}, a node set $H$ that stores the hidden node set corresponding to the prefix $x_{:i_0+1}$ in Claim \ref{['claim:lem_proof_H']}, a node set $\tilde{H}$ that tracks the hidden node set computing from extending the fixed prefix $x_{:i_0+1}$ with the $r_1$-length prefix of the length-$k$ string $z^{(j_1)}_{:r_1+1}$ in Claim \ref{['claim:lem_proof_Htilde']}, and a node set $R$ that produces the final output in Claim \ref{['claim:lem_proof_R']}. Claim \ref{['claim:general_i0*']} introduces another counter node, and studies the initial case when $i_1\leq i_0^*$. Note that the counter nodes and the node sets $Y,H,E$ serve as the hidden node set of the constructed RNN $U$.
  • Figure 2.2: The Load-Run-Hold schedule for the node set $H$ within each input loop.
  • Figure 2.3: The Load-Run-Hold schedule for the node set $\tilde{H}$ within each input loop of $(2^k+1)k\tau$ steps. The loop begins with $2^k$ "string loops" of length $k\tau$, each divided into $k$ "digit loops" of length $\tau$. For the first $k-1$ digit loops, the schedule is $\mathcal{T}_Q-1$ steps of RUN, $\tau-\mathcal{T}_Q$ steps of HOLD, and one final step of RUN. The final digit loop (the $k$-th one) within each of these string loops is $\mathcal{T}_Q-1$ steps of RUN, $\tau-\mathcal{T}_Q$ steps of HOLD, and one step of LOAD from the node set $H$. The entire sequence concludes with a final $k\tau$ "string loop" which consists of $k\tau-1$ steps of HOLD, followed by one last step of LOAD from the node set $H$.
  • ...and 1 more figures

Theorems & Definitions (80)

  • Definition 1: Language Model
  • Definition 2: Next-$k$-token Distinguisher
  • Theorem 1: Minimizing Next-token Loss Yields an Indistinguishable LM
  • Lemma 1: Boosted Text Distribution
  • Lemma 2: Boosted Next-token Probability
  • Lemma 3: RNN Boosting
  • Lemma 4: Model Self-boosting and Loss Minimization
  • Lemma 5: Next-token Loss and Maximum Log-Likelihood
  • proof
  • Lemma 6: Next-token Loss and KL Divergence
  • ...and 70 more