Table of Contents
Fetching ...

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

TL;DR

This work tackles the challenge of length generalization in arithmetic Transformer models by introducing position coupling, which embeds explicit task structure into the positional encoding. By grouping semantically related digits and assigning shared position IDs, decoder-only Transformers trained on short sequences (e.g., up to 30 digits) can generalize to much longer inputs (up to 200 digits) with high exact-match accuracy, and a constructive 1-layer, 2-head Transformer is shown to solve additions with exponentially long lengths. The authors provide theoretical justifications, including a constructive proof for the 1-layer case and an impossibility result without positional information, and demonstrate the approach extends to Nx2 multiplication and a 2D Minesweeper task. Overall, position coupling offers a scalable, structure-aware bias that dramatically improves length extrapolation for algorithmic tasks and suggests avenues for learning task structure without hand-crafting couplings. $

Abstract

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

TL;DR

This work tackles the challenge of length generalization in arithmetic Transformer models by introducing position coupling, which embeds explicit task structure into the positional encoding. By grouping semantically related digits and assigning shared position IDs, decoder-only Transformers trained on short sequences (e.g., up to 30 digits) can generalize to much longer inputs (up to 200 digits) with high exact-match accuracy, and a constructive 1-layer, 2-head Transformer is shown to solve additions with exponentially long lengths. The authors provide theoretical justifications, including a constructive proof for the 1-layer case and an impossibility result without positional information, and demonstrate the approach extends to Nx2 multiplication and a 2D Minesweeper task. Overall, position coupling offers a scalable, structure-aware bias that dramatically improves length extrapolation for algorithmic tasks and suggests avenues for learning task structure without hand-crafting couplings. $

Abstract

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
Paper Structure (99 sections, 6 theorems, 100 equations, 23 figures, 90 tables)

This paper contains 99 sections, 6 theorems, 100 equations, 23 figures, 90 tables.

Key Result

Theorem 5.1

With the input format described in subsec:coupling_for_addition, there exists a depth-1 two-head decoder-only Transformer with coupled positions that solves the addition task with next-token prediction. Here, the operand length is at most $2^{\left\lfloor (d-17)/2 \right\rfloor}-2$, where the embedd

Figures (23)

  • Figure 1: Methods for Length Generalization in the Integer Addition Task. We report exact-match (EM) accuracies (markers: medians over experiments; light area: 95% confidence intervals). We employ the reversed format and zero-paddings lee2024teaching into the input sequence. With our proposed position coupling, we achieve more than 95% exact-match accuracy for up to 200-digit additions with decoder-only Transformers trained on up to 30-digit additions. For index hinting zhou2024what, we separately test absolute positional embedding (APE) with a random starting position ID (mimicking the original implementation by zhou2024what) and without positional encoding (NoPE) kazemnejad2023impact (as tested by zhou2024transformers).
  • Figure 2: Position coupling for decimal integer addition task, displaying $653+49=702$ with appropriate input formats. The starting position ID '6' is an arbitrarily chosen number.
  • Figure 3: Ablation on the trained operand lengths (1-layer 4-head models).
  • Figure 4: Ablation on the number of layers (trained with position coupling).
  • Figure 5: Ablation on the data formats (1-layer 4-head models trained with position coupling).
  • ...and 18 more figures

Theorems & Definitions (7)

  • Theorem 5.1
  • Proposition 5.1
  • Theorem 6.1
  • Theorem E.1
  • Proposition F.0
  • proof
  • Theorem G.1