Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure
Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun
TL;DR
This work tackles the challenge of length generalization in arithmetic Transformer models by introducing position coupling, which embeds explicit task structure into the positional encoding. By grouping semantically related digits and assigning shared position IDs, decoder-only Transformers trained on short sequences (e.g., up to 30 digits) can generalize to much longer inputs (up to 200 digits) with high exact-match accuracy, and a constructive 1-layer, 2-head Transformer is shown to solve additions with exponentially long lengths. The authors provide theoretical justifications, including a constructive proof for the 1-layer case and an impossibility result without positional information, and demonstrate the approach extends to Nx2 multiplication and a 2D Minesweeper task. Overall, position coupling offers a scalable, structure-aware bias that dramatically improves length extrapolation for algorithmic tasks and suggests avenues for learning task structure without hand-crafting couplings. $
Abstract
Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
