Looped Transformers for Length Generalization
Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee
TL;DR
The paper addresses the longstanding challenge of length generalization in algorithmic tasks by introducing Looped Transformers with adaptive iteration grounded in n-RASP-L problems. By training a single decoder block across multiple iterations and supervising only the final output, the model learns step-dependent strategies that generalize to longer inputs when the iteration count is adjusted at inference. Empirical results across Parity, Copy, Addition, and other tasks show substantial improvements over fixed-depth and standard NTP baselines, with effective adaptive stopping rules. This work offers a scalable approach to enabling adaptive computation for length generalization without requiring intermediate step supervision. It has potential implications for building more robust, length-flexible reasoning systems and motivates further exploration of looped architectures and step-aware training.
Abstract
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
