Table of Contents
Fetching ...

Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks

Mahdi Sabbaghi, George Pappas, Hamed Hassani, Surbhi Goel

TL;DR

This work tackles the problem that Transformers struggle to generalize arithmetic tasks to longer lengths due to mismatches between the numerical structure and standard positional encodings. It introduces explicit structural biases through Relative Positional Encoding (RPE) for addition and Uniform Positional Encoding (UPE) for multiplication, combined with a data-format that preserves digit-level structure, enabling generalization from short sequences (e.g., $5$ digits) to long sequences (e.g., $50$ digits) without extra long-sequence data. The authors provide theoretical results in a population regime showing APE fails to generalize while RPE generalizes, and show that data augmentation alone is insufficient; they also demonstrate that increasing training complexity improves long-range generalization. Extending to data with text and numbers, they show the approach can be applied beyond pure arithmetic, and discuss how tuning the training distribution to include higher-complexity examples further enhances robustness, offering practical guidance for length-generalizable architectures in structured tasks.

Abstract

Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.

Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks

TL;DR

This work tackles the problem that Transformers struggle to generalize arithmetic tasks to longer lengths due to mismatches between the numerical structure and standard positional encodings. It introduces explicit structural biases through Relative Positional Encoding (RPE) for addition and Uniform Positional Encoding (UPE) for multiplication, combined with a data-format that preserves digit-level structure, enabling generalization from short sequences (e.g., digits) to long sequences (e.g., digits) without extra long-sequence data. The authors provide theoretical results in a population regime showing APE fails to generalize while RPE generalizes, and show that data augmentation alone is insufficient; they also demonstrate that increasing training complexity improves long-range generalization. Extending to data with text and numbers, they show the approach can be applied beyond pure arithmetic, and discuss how tuning the training distribution to include higher-complexity examples further enhances robustness, offering practical guidance for length-generalizable architectures in structured tasks.

Abstract

Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.
Paper Structure (45 sections, 1 theorem, 13 equations, 13 figures, 1 table, 2 algorithms)

This paper contains 45 sections, 1 theorem, 13 equations, 13 figures, 1 table, 2 algorithms.

Key Result

Proposition 5.1

For the seq-to-seq regression task, the transformer model after training with GD with infinitesimal weight-decay on the positional vectors in the population regime:

Figures (13)

  • Figure 2: (a) Accuracy of the model using absolute positional encoding while increasing the length of the sequence. (b) The model with RPE trained with up to 5-digit sums and tested on up to 50-digit sum. While showing high fluctuations, the inductive bias of relativity makes generalization possible. (c) The model with APE when trained with augmented data, i.e., all shifted versions of samples from the dataset are given in training.
  • Figure 3:
  • Figure 4: (a) Accuracy of a BERT model with APE for single-digit $\times$ multi-digit multiplication when trained only up to 5-digit multiplicands. (b) Same setting as in (a) but with RPE. (c) Same setting as (a) but with our proposed UPE. Using the uniform symmetry naturally gives advantages over RPE.
  • Figure 5: (a) Accuracy of a BERT model with RPE for 3-digit multi-digit multiplication when trained only up to 5-digit multiplicands. (b) Same setting as in (a) but with our new positional encoding. The difference between UPE and RPE becomes more apparent as the length of the multiplier increases. (c) The model using APE with the privilege of augmented data. (d) The model with RPE and trained with augmented data. Both models with APE and RPE are outperformed by our UPE.
  • Figure 6: Extra attention heads utilized with pairwise positional encodings are integrated into the model to exploit structures across various tasks. The diagram illustrates how certain attention heads employ relative position encodings to enable length generalization for the addition task.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Proposition 5.1: informal