Table of Contents
Fetching ...

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

Shaoxiong Duan, Yining Shi, Wei Xu

TL;DR

This work probes the length generalization of Transformer models on arithmetic tasks, revealing that correct attention patterns are crucial for extrapolation. It introduces Attention Bias Scaffolding (ABS) to steer attention via windowed biases and Cyclic Position Indexing, yielding complete generalization on multiple tasks, including Parity. Building on this, Attention Bias Calibration (ABC) automatically derives extrapolatable attention biases from interpolation data and retrains with these biases, achieving near-perfect to perfect performance up to 50 digits and showing a close relationship to Relative Position Encoding. Collectively, the approach demonstrates a scalable path to extend Transformer capabilities beyond train-time lengths and suggests broader applicability to more complex tasks and NLP domains.

Abstract

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

TL;DR

This work probes the length generalization of Transformer models on arithmetic tasks, revealing that correct attention patterns are crucial for extrapolation. It introduces Attention Bias Scaffolding (ABS) to steer attention via windowed biases and Cyclic Position Indexing, yielding complete generalization on multiple tasks, including Parity. Building on this, Attention Bias Calibration (ABC) automatically derives extrapolatable attention biases from interpolation data and retrains with these biases, achieving near-perfect to perfect performance up to 50 digits and showing a close relationship to Relative Position Encoding. Collectively, the approach demonstrates a scalable path to extend Transformer capabilities beyond train-time lengths and suggests broader applicability to more complex tasks and NLP domains.

Abstract

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.
Paper Structure (23 sections, 13 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 13 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Extrapolation results for models trained on $L_\mathit{int} \leq 6$ on Successor and Addition. Length is measured in the number of digits of one operand.
  • Figure 2: Attention heat maps for Successor (Left) and Addition (Right).
  • Figure 3: Examples of the different diagonals ABC can take
  • Figure 4: ABC cross attention bias for Addition
  • Figure 5: ABC cross attention bias for $N \times 1$
  • ...and 6 more figures