A Formal Framework for Understanding Length Generalization in Transformers
Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, Michael Hahn
TL;DR
The paper tackles why transformers struggle or succeed at length generalization when processing longer sequences than seen during training. It introduces an idealized inference framework and two mathematical formalisms—Limit Transformers and C-RASP—to rigorously analyze length generalization, proving that any ground-truth function expressible by a Limit Transformer with Periodic and Local positional use will generalize for sufficiently long inputs. Experimental results across algorithmic tasks and formal languages validate the theory’s predictive power, showing length generalization correlates with CRASP expressiveness and that certain tasks are inherently non-generalizable under the proposed constraints. Overall, the work bridges empirical observations with formal guarantees, offering a principled path toward predicting and understanding length generalization in transformers.
Abstract
A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.
