Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Hanseul Cho; Jaeyoung Cha; Pranjal Awasthi; Srinadh Bhojanapalli; Anupam Gupta; Chulhee Yun

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun

TL;DR

This work tackles the challenge of length generalization in arithmetic Transformer models by introducing position coupling, which embeds explicit task structure into the positional encoding. By grouping semantically related digits and assigning shared position IDs, decoder-only Transformers trained on short sequences (e.g., up to 30 digits) can generalize to much longer inputs (up to 200 digits) with high exact-match accuracy, and a constructive 1-layer, 2-head Transformer is shown to solve additions with exponentially long lengths. The authors provide theoretical justifications, including a constructive proof for the 1-layer case and an impossibility result without positional information, and demonstrate the approach extends to Nx2 multiplication and a 2D Minesweeper task. Overall, position coupling offers a scalable, structure-aware bias that dramatically improves length extrapolation for algorithmic tasks and suggests avenues for learning task structure without hand-crafting couplings. $

Abstract

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

TL;DR

Abstract

Paper Structure (99 sections, 6 theorems, 100 equations, 23 figures, 90 tables)

This paper contains 99 sections, 6 theorems, 100 equations, 23 figures, 90 tables.

Introduction
Summary of Contributions
Preliminaries
Data Formats
Positional Embeddings/Encodings (PE)
Position Coupling: A Method for Length Generalization
Position Coupling for Decimal Integer Addition Task
Experiments on the Addition Task
Results
Theoretical Analyses on 1-layer Transformers
1-layer Transformer with Coupled Positions can Perform Long Additions
Probing the Attention Patterns in Trained Transformers with Position Coupling
1-layer Transformers Require Positional Information
Applying Position Coupling Beyond Addition Task
Position Coupling for Nx2 Multiplication Tasks
...and 84 more sections

Key Result

Theorem 5.1

With the input format described in subsec:coupling_for_addition, there exists a depth-1 two-head decoder-only Transformer with coupled positions that solves the addition task with next-token prediction. Here, the operand length is at most $2^{\left\lfloor (d-17)/2 \right\rfloor}-2$, where the embedd

Figures (23)

Figure 1: Methods for Length Generalization in the Integer Addition Task. We report exact-match (EM) accuracies (markers: medians over experiments; light area: 95% confidence intervals). We employ the reversed format and zero-paddings lee2024teaching into the input sequence. With our proposed position coupling, we achieve more than 95% exact-match accuracy for up to 200-digit additions with decoder-only Transformers trained on up to 30-digit additions. For index hinting zhou2024what, we separately test absolute positional embedding (APE) with a random starting position ID (mimicking the original implementation by zhou2024what) and without positional encoding (NoPE) kazemnejad2023impact (as tested by zhou2024transformers).
Figure 2: Position coupling for decimal integer addition task, displaying $653+49=702$ with appropriate input formats. The starting position ID '6' is an arbitrarily chosen number.
Figure 3: Ablation on the trained operand lengths (1-layer 4-head models).
Figure 4: Ablation on the number of layers (trained with position coupling).
Figure 5: Ablation on the data formats (1-layer 4-head models trained with position coupling).
...and 18 more figures

Theorems & Definitions (7)

Theorem 5.1
Proposition 5.1
Theorem 6.1
Theorem E.1
Proposition F.0
proof
Theorem G.1

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

TL;DR

Abstract

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (7)