Provable Long-Range Benefits of Next-Token Prediction

Xinyuan Cao; Santosh S. Vempala

Provable Long-Range Benefits of Next-Token Prediction

Xinyuan Cao, Santosh S. Vempala

TL;DR

The paper establishes a formal, complexity-theoretic link between training a language model via next-token prediction and the model’s long-range coherence. By introducing the notions of next-k-token distinguishers and indistinguishability, it proves that a model trained to minimize next-token loss becomes statistically indistinguishable from the training distribution for all bounded windows, regardless of document length. The core mechanism combines a distinguisher-based boosting framework with efficient RNN implementations, including synchronized enumeration, to yield polynomial-size models whose loss decreases imply improved KL distance to the truth. It further extends these results to bounded-bit-size computations, showing practical space limits do not destroy indistinguishability guarantees. Collectively, the work provides a rigorous explanation for why next-token prediction captures long-range structure and coherence in autoregressive language models, with concrete complexity bounds and guidance for scaling and implementation.

Abstract

Why do modern language models, trained to do well on next-word prediction, appear to generate coherent documents and capture long-range structure? Here we show that next-token prediction is provably powerful for learning longer-range structure, even with common neural network architectures. Specifically, we prove that optimizing next-token prediction over a Recurrent Neural Network (RNN) yields a model that closely approximates the training distribution: for held-out documents sampled from the training distribution, no algorithm of bounded description length limited to examining the next $k$ tokens, for any $k$, can distinguish between $k$ consecutive tokens of such documents and $k$ tokens generated by the learned language model following the same prefix. We provide polynomial bounds (in $k$, independent of the document length) on the model size needed to achieve such $k$-token indistinguishability, offering a complexity-theoretic explanation for the long-range coherence observed in practice.

Provable Long-Range Benefits of Next-Token Prediction

TL;DR

Abstract

Provable Long-Range Benefits of Next-Token Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (80)