Table of Contents
Fetching ...

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

Hanseul Cho, Jaeyoung Cha, Srinadh Bhojanapalli, Chulhee Yun

TL;DR

This work addresses the long-standing problem of length generalization in Transformer-based arithmetic by introducing a synergistic combination of scratchpad reasoning and Position Coupling to constrain attention to a fixed number of tokens per inference step. It demonstrates substantial gains on two challenging tasks: multi-operand addition with varying operand counts and lengths, and integer multiplication with variable operand lengths, achieving up to 30 operands of 30 digits and 20-digit by 15-digit multiplication with notable accuracy. A constructive theorem shows that a small 1-layer Transformer with scratchpad can solve multi-operand addition for exponentially long operands and operand counts, offering theoretical grounding for the empirical results. The findings highlight the potential of structured scratchpad formats and Abacus-style position embeddings to extend the practical capabilities of arithmetic Transformers and guide future work on length generalization in tasks with well-defined structure.

Abstract

Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of \Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

TL;DR

This work addresses the long-standing problem of length generalization in Transformer-based arithmetic by introducing a synergistic combination of scratchpad reasoning and Position Coupling to constrain attention to a fixed number of tokens per inference step. It demonstrates substantial gains on two challenging tasks: multi-operand addition with varying operand counts and lengths, and integer multiplication with variable operand lengths, achieving up to 30 operands of 30 digits and 20-digit by 15-digit multiplication with notable accuracy. A constructive theorem shows that a small 1-layer Transformer with scratchpad can solve multi-operand addition for exponentially long operands and operand counts, offering theoretical grounding for the empirical results. The findings highlight the potential of structured scratchpad formats and Abacus-style position embeddings to extend the practical capabilities of arithmetic Transformers and guide future work on length generalization in tasks with well-defined structure.

Abstract

Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of \Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.

Paper Structure

This paper contains 49 sections, 2 theorems, 25 equations, 20 figures, 26 tables.

Key Result

Theorem 4.1

With a proper input format, scratchpad, and Position Coupling, there exists a 1-layer 4-head decoder-only Transformer that solves the multi-operand integer addition task involving up to $m$ operands each with up to $n$ digits. Here, a sufficient choice of the embedding dimension is $d = {\mathcal{O}

Figures (20)

  • Figure 1: Unlocking Length Generalization on Multi-Operand Addition Task. We present median exact-match accuracies for 6-layer 8-head decoder-only Transformers trained on multi-operand additions of 2--10 operands, each having 1--10 digits (red boxes represent the scope of trained lengths). We compare three state-of-the-art position embedding (PE) methods for length generalization: NoPE kazemnejad2023impact , FIRE li2024functional , and Position Coupling cho2024positionmcleish2024transformers . With a proper scratchpad enabling Transformers to do extrinsic multi-step reasoning (described in \ref{['sec:multi_operand_addition']}), all three PE methods can extend their generalization scope (blue area of heatmaps). Remarkably, with our proposed multi-level Position Coupling with scratchpad, we achieve a significant length generalization superior to all other methods.
  • Figure 2: An illustration of a scratchpad for a parity problem (with a query 0101).
  • Figure 3: Parity task. We report the accuracies only for the answer token (i.e., the token before EOS) (light area: 95% confidence intervals). The gray region indicates the query lengths in our training data. A complete failure is indicated by the accuracy $\simeq$50%: a random guess between 0 and 1.
  • Figure 4: An example input sequence equipped with scratchpad and bi-level Position Coupling. The original example was "57+48+96=201": the query is inside a blue box and the response is inside a red box. All numbers in the scratchpad (i.e., intermediate steps) are reversed; all numbers in the whole sequence are minimally zero-padded to match their length.
  • Figure 5: Comparison of training lengths in the integer addition task. We report exact-match accuracies (median over at least 4 runs) for the addition task. The red box indicates the training distribution: from the left plot, we trained on ${\mathcal{S}}_A(7,7)$, ${\mathcal{S}}_A(10,10)$, and ${\mathcal{S}}_A(13,13)$.
  • ...and 15 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • Theorem D.1