Table of Contents
Fetching ...

Quantitative Bounds for Length Generalization in Transformers

Zachary Izzo, Eshaan Nichani, Jason D. Lee

TL;DR

This work provides quantitative, non-asymptotic bounds on length generalization (LG) for transformers by embedding the study in a limit-transformer framework and proving that LG arises when a model's behavior on long inputs can be simulated by shorter training strings. It analyzes both finite- and infinite-precision regimes, deriving explicit bounds: for one-layer finite-precision LT, LG bound $N = O\left( \max\{ 2^{p/\gamma}, (L^2 \Delta^7 |\Sigma|^6 \tau^2)/\varepsilon^2 \} \right)$, and for two-layer infinite-precision LT, $N \lesssim (\max(C(f), C(g)) \varepsilon^{-1})^{\max(\gamma(f)^{-1}, \gamma(g)^{-1}, 3)}$, with a Dirichlet-average average-case bound $N_0$ as well. The authors introduce a simulation argument that constructs shorter strings preserving sufficient statistics to approximate longer-input computations, then validate the theory with synthetic LG tasks showing empirical alignment with the predicted scaling and hard-attention behavior. The results deepen theoretical understanding of extrapolation in transformers and guide data allocation for context-length generalization in practice, while outlining avenues for extending to deeper architectures and broader input distributions.

Abstract

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Quantitative Bounds for Length Generalization in Transformers

TL;DR

This work provides quantitative, non-asymptotic bounds on length generalization (LG) for transformers by embedding the study in a limit-transformer framework and proving that LG arises when a model's behavior on long inputs can be simulated by shorter training strings. It analyzes both finite- and infinite-precision regimes, deriving explicit bounds: for one-layer finite-precision LT, LG bound , and for two-layer infinite-precision LT, , with a Dirichlet-average average-case bound as well. The authors introduce a simulation argument that constructs shorter strings preserving sufficient statistics to approximate longer-input computations, then validate the theory with synthetic LG tasks showing empirical alignment with the predicted scaling and hard-attention behavior. The results deepen theoretical understanding of extrapolation in transformers and guide data allocation for context-length generalization in practice, while outlining avenues for extending to deeper architectures and broader input distributions.

Abstract

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Paper Structure

This paper contains 32 sections, 18 theorems, 179 equations, 4 figures.

Key Result

Theorem 4.1

There exists an $N = O\left(\max\left\{2^{p/\gamma}, \,\frac{L^2\Delta^7|\Sigma|^6\tau^2}{\varepsilon^2}\right\}\right)$ such that $\|f(x) - g(x)\|\leq \varepsilon$ for all $|x| \leq N$ implies that $\|f(x) - g(x)\| = O(\varepsilon)$ for any sequence $x$.

Figures (4)

  • Figure 1: Experiments on $\mathrm{SimpleTask}$, for varying values of $\omega$. Left: For fixed training length, as test length increases, the test loss plateaus at a finite value. Right: The value the test loss plateaus at decreases monotonically with training length, and increases monotonically with $\omega$.
  • Figure 2: Experiments on $\mathrm{ModPTask}$, for varying values of $\Delta = p$. Left: For fixed training length and $\Delta$, as test length increases, the test loss plateaus at a finite value. Right: The value the test loss plateaus at decreases monotonically with training length, and increases monotonically with $\Delta$.
  • Figure 3: For the $\mathrm{ModPTask}$, the softmax attention approximates uniform attention on all positions $\equiv k \mod p$.
  • Figure 4: Experiments on the in-context $k$-gram task, for varying $k$ and vocabulary size $S$. Left: For fixed training length, as test length increases, the test loss plateaus at a finite value. Middle: The value the test loss plateaus at decreases monotonically with training length, and increases with $S$. Right: The value the test loss plateaus at also increases monotonically with $k$.

Theorems & Definitions (39)

  • Theorem 4.1
  • proof : Proof sketch
  • Theorem 4.2
  • proof : Proof sketch
  • Definition 5.1: Complexity and positional margin
  • Theorem 5.2
  • proof : Proof sketch
  • Lemma 5.2
  • Definition 5.3
  • Definition A.1
  • ...and 29 more