Table of Contents
Fetching ...

Transformers Can Achieve Length Generalization But Not Robustly

Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou

TL;DR

This work probes why Transformers struggle to generalize to longer sequences and demonstrates that careful alignment of data formatting and position encoding can enable substantial length generalization for decimal addition, achieving 2.5× extrapolation to 100 digits when trained on 40-digit sequences. The authors propose a recipe combining FIRE position encoding, randomized position encoding, a reversed arithmetic format, and index hints, and show that this setup yields strong out-of-distribution generalization but with high sensitivity to random seeds and data ordering. A central insight is that data-architecture synergy, not mere model scale, governs long-range extrapolation, though robust robustness across seeds remains elusive. The results advance our understanding of how to engineer length generalization in Transformers for arithmetic tasks and point to avenues for making such generalization more reliable in practice.

Abstract

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Transformers Can Achieve Length Generalization But Not Robustly

TL;DR

This work probes why Transformers struggle to generalize to longer sequences and demonstrates that careful alignment of data formatting and position encoding can enable substantial length generalization for decimal addition, achieving 2.5× extrapolation to 100 digits when trained on 40-digit sequences. The authors propose a recipe combining FIRE position encoding, randomized position encoding, a reversed arithmetic format, and index hints, and show that this setup yields strong out-of-distribution generalization but with high sensitivity to random seeds and data ordering. A central insight is that data-architecture synergy, not mere model scale, governs long-range extrapolation, though robust robustness across seeds remains elusive. The results advance our understanding of how to engineer length generalization in Transformers for arithmetic tasks and point to avenues for making such generalization more reliable in practice.

Abstract

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.
Paper Structure (56 sections, 2 equations, 43 figures, 1 table)

This paper contains 56 sections, 2 equations, 43 figures, 1 table.

Figures (43)

  • Figure 1: Using an appropriate position encoding and data formatting, we demonstrate that Transformers can generalize to 100-digit decimal addition tasks with more than 98% of accuracy when trained up to 40-digit addition, resulting in a length extension ratio of $2.5\times$, which is much more than the ratio of lee2023teaching ($1.0\times$), kazemnejad2023impact ($1.125\times$), shen2023positional ($1.1\times$), and zhou2023algorithms ($1.5\times$). Unfilled markers (— ${\color{white}\blacktriangledown}$$\triangledown$ ) denote in-distribution test results, filled markers (— $\blacktriangledown$) denote out-of-distribution results. In zhou2023algorithms and Our Work, each curve is the best out of 10 trials. For the other three methods, we report the value from their corresponding paper.
  • Figure 2: Comparative overview of PEs and data formats: While most related studies focus on APE or NoPE, our approach integrates FIRE li2023functional and Randomized PE ruoss2023randomized. All studies utilize a reversed format. shen2023positional enhance this with random space augmentation, and both zhou2023algorithms and Our Work incorporate index hints.
  • Figure 3: EM accuracy (best of 10 trials), trained exclusively on sequences of lengths 1 to 40, the best trials involving FIRE exhibit near-perfect generalization on 100-digit addition.
  • Figure 4: EM accuracy of models trained with and without index hints (best of 10 trials): Without index hints, all PE methods fail in generalization, both within and beyond trained lengths.
  • Figure 5: EM accuracy of the standard vs. the reversed format: Consistently with prior studies, the reversed format excels over the standard format across all PEs.
  • ...and 38 more figures