Table of Contents
Fetching ...

Universal Length Generalization with Turing Programs

Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach

TL;DR

<3-5 sentence high-level summary> The paper tackles the problem of length generalization in transformer models for algorithmic tasks. It introduces Turing Programs, a universal scratchpad strategy inspired by Turing Machines, and shows that combining them with Hard-ALiBi enables robust length extrapolation on addition, multiplication, and SGD-like tasks. The authors provide theoretical results showing transformers can implement Turing Programs via a restricted RASP framework, and demonstrate broader applicability through experiments on multiple algorithmic tasks and random Turing machine simulations. This work suggests a pathway toward universal length generalization for a wide class of algorithms, with practical implications for scalable reasoning and program execution in large language models.

Abstract

Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

Universal Length Generalization with Turing Programs

TL;DR

<3-5 sentence high-level summary> The paper tackles the problem of length generalization in transformer models for algorithmic tasks. It introduces Turing Programs, a universal scratchpad strategy inspired by Turing Machines, and shows that combining them with Hard-ALiBi enables robust length extrapolation on addition, multiplication, and SGD-like tasks. The authors provide theoretical results showing transformers can implement Turing Programs via a restricted RASP framework, and demonstrate broader applicability through experiments on multiple algorithmic tasks and random Turing machine simulations. This work suggests a pathway toward universal length generalization for a wide class of algorithms, with practical implications for scalable reasoning and program execution in large language models.

Abstract

Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.
Paper Structure (51 sections, 1 theorem, 2 equations, 8 figures, 1 table)

This paper contains 51 sections, 1 theorem, 2 equations, 8 figures, 1 table.

Key Result

Theorem 4.1

Let $T$ be a Turing Machine s.t. 1) $T$ does not generate repeated $n$-grams and 2) $T$ operates in-memory. Then, there exists a RASP program $P$ of size (number of lines) $O(n)$ s.t. for every input $x$ without repeated $n$-grams, $P$ correctly simulates $T$ for $\exp(n)$ steps.

Figures (8)

  • Figure 1: Turing Program example for simulating a Turing Machine with scratchpad.
  • Figure 2: Turing Program for addition, text in comments is not part of the input.
  • Figure 3: (a): Comparison of different positional encodings and data formats for length generalization on addition. We see significant extrapolation to longer sequence lengths with Hard-ALiBi and scratchpad. In this figure and in the rest of the paper, the shade shows the $95\%$ confidence intervals. (b): Hard-ALiBi with scratchpad, trained with 5 different initialization seeds. While there is significant variability across training runs, results are more robust than prior work.
  • Figure 4: Turing Program for 3-digit multiplication. At each step, we update three information: the head position, the result of the "local" multiplication, the carry and the intermediate result of the "global" multiplication.
  • Figure 5: Length generalization on multiplication by 1 and 3 digit numbers.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 4.1
  • Theorem 4.1