Looped Transformers for Length Generalization

Ying Fan; Yilun Du; Kannan Ramchandran; Kangwook Lee

Looped Transformers for Length Generalization

Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee

TL;DR

The paper addresses the longstanding challenge of length generalization in algorithmic tasks by introducing Looped Transformers with adaptive iteration grounded in n-RASP-L problems. By training a single decoder block across multiple iterations and supervising only the final output, the model learns step-dependent strategies that generalize to longer inputs when the iteration count is adjusted at inference. Empirical results across Parity, Copy, Addition, and other tasks show substantial improvements over fixed-depth and standard NTP baselines, with effective adaptive stopping rules. This work offers a scalable approach to enabling adaptive computation for length generalization without requiring intermediate step supervision. It has potential implications for building more robust, length-flexible reasoning systems and motivates further exploration of looped architectures and step-aware training.

Abstract

Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

Looped Transformers for Length Generalization

TL;DR

Abstract

Paper Structure (50 sections, 3 theorems, 5 equations, 9 figures, 2 tables)

This paper contains 50 sections, 3 theorems, 5 equations, 9 figures, 2 tables.

Introduction
Background
RASP-L
Next-token prediction and full-output prediction
-RASP-L
Learning -RASP-L problems with looped Transformers
End-to-end supervised data without intermediate step supervision
Looped training with step supervision
Architecture of the looped Transformers
Positional embeddings:
Training algorithm
Adaptive inference
Related work
Positional embedding for length generalization.
RNNs and Chomsky Hierarchy.
...and 35 more sections

Key Result

Proposition 3.2

(Parity.) There exists a n-RASP-L program with $T(n) = n$ that solves the $n$-bit parity check task: where $y$ is the parity check result for the arbitrary binary input sequence $\{x_i\}$.

Figures (9)

Figure 1: Method Overview. During training, we supervise the output of the model to match the target data only after a certain number of steps of applying the same decoder block, helping the model learn intermediate steps that can be reused and can handle input of arbitrary lengths. All grey blocks share the same parameters. Examples are from the Copy task with $n$ symbols. "#" indicates EOS, "*" indicates ignored output, and ">" indicates the end of the query (EOQ).
Figure 2: Visualization of the next-token prediction (NTP) and full-output prediction (FOP) schemes. "#" indicates EOS, "*" indicates ignored output, and ">" indicates the end of the query (EOQ).
Figure 3: Visualization of the $n$-RASP-L solutions for Copy, Parity, and Addition with $n=2$. Copy is implemented by $n$ iterations of shifting; Parity is implemented by $n$ iterations of shifting and XOR; Addition is implemented by $n+1$ iterations of shifted XOR and AND; The inputs are preprocessed. See details in Section \ref{['sec:n-rasp-l']}.
Figure 4: Length Generalization Performance. Our looped Transformer model with adaptive depth generalized better than NTP methods across studied tasks, including the variants with pause tokens and weight-tied layers. The vertical dashed line indicates the maximum training length.
Figure 5: Stopping criterion visualizations.Plot of the stopping criterion on the test set. The vertical line indicates the step chosen from Equation (\ref{['eq: step']}) (where $B=N_{\text{test}}$ equals the size of the test set) within the step range shown in the plots. The accuracy is on the full test set. The chosen steps have accuracy $\approx1$ across tasks.
...and 4 more figures

Theorems & Definitions (7)

Definition 3.1: $n$-RASP-L
Proposition 3.2
proof
Proposition 3.3
proof
Proposition 3.4
proof

Looped Transformers for Length Generalization

TL;DR

Abstract

Looped Transformers for Length Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)