Quantitative Bounds for Length Generalization in Transformers

Zachary Izzo; Eshaan Nichani; Jason D. Lee

Quantitative Bounds for Length Generalization in Transformers

Zachary Izzo, Eshaan Nichani, Jason D. Lee

TL;DR

This work provides quantitative, non-asymptotic bounds on length generalization (LG) for transformers by embedding the study in a limit-transformer framework and proving that LG arises when a model's behavior on long inputs can be simulated by shorter training strings. It analyzes both finite- and infinite-precision regimes, deriving explicit bounds: for one-layer finite-precision LT, LG bound $N = O\left( \max\{ 2^{p/\gamma}, (L^2 \Delta^7 |\Sigma|^6 \tau^2)/\varepsilon^2 \} \right)$, and for two-layer infinite-precision LT, $N \lesssim (\max(C(f), C(g)) \varepsilon^{-1})^{\max(\gamma(f)^{-1}, \gamma(g)^{-1}, 3)}$, with a Dirichlet-average average-case bound $N_0$ as well. The authors introduce a simulation argument that constructs shorter strings preserving sufficient statistics to approximate longer-input computations, then validate the theory with synthetic LG tasks showing empirical alignment with the predicted scaling and hard-attention behavior. The results deepen theoretical understanding of extrapolation in transformers and guide data allocation for context-length generalization in practice, while outlining avenues for extending to deeper architectures and broader input distributions.

Abstract

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Quantitative Bounds for Length Generalization in Transformers

TL;DR

, and for two-layer infinite-precision LT,

, with a Dirichlet-average average-case bound

as well. The authors introduce a simulation argument that constructs shorter strings preserving sufficient statistics to approximate longer-input computations, then validate the theory with synthetic LG tasks showing empirical alignment with the predicted scaling and hard-attention behavior. The results deepen theoretical understanding of extrapolation in transformers and guide data allocation for context-length generalization in practice, while outlining avenues for extending to deeper architectures and broader input distributions.

Abstract

error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Quantitative Bounds for Length Generalization in Transformers

TL;DR

Abstract

Quantitative Bounds for Length Generalization in Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (39)