Transformers Can Achieve Length Generalization But Not Robustly
Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou
TL;DR
This work probes why Transformers struggle to generalize to longer sequences and demonstrates that careful alignment of data formatting and position encoding can enable substantial length generalization for decimal addition, achieving 2.5× extrapolation to 100 digits when trained on 40-digit sequences. The authors propose a recipe combining FIRE position encoding, randomized position encoding, a reversed arithmetic format, and index hints, and show that this setup yields strong out-of-distribution generalization but with high sensitivity to random seeds and data ordering. A central insight is that data-architecture synergy, not mere model scale, governs long-range extrapolation, though robust robustness across seeds remains elusive. The results advance our understanding of how to engineer length generalization in Transformers for arithmetic tasks and point to avenues for making such generalization more reliable in practice.
Abstract
Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.
