Table of Contents
Fetching ...

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Alireza Amiri, Xinting Huang, Mark Rofin, Michael Hahn

TL;DR

This work establishes unconditional lower bounds on the length of chain-of-thought reasoning in hard-attention (UHAT) transformers across several algorithmic tasks, proving that CoTs must grow at least linearly with input size in many cases. Using random-restriction methods and a depth-reduction argument, the authors show that sublinear CoTs would force constant outputs on large input subsets, which is incompatible with parity-like functions. They apply these generic bounds to concrete problems—Parity, Multiplication, Median, and Reachability—demonstrating Omega(N) or Omega(N log N) CoT requirements, and show these bounds are tight up to polylog factors; they also analyze the limitations of dot-by-dot CoTs. Complementing theory with experiments on synthetic transformers and pretrained LLMs, the paper argues that sublinear CoTs are unlikely to suffice for the investigated tasks, motivating tool use or architectural innovations for efficient reasoning. Overall, the results delineate the power and limits of CoT reasoning in transformers and quantify the inference-time compute necessary for robust algorithmic reasoning.

Abstract

Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from $TC^0$ to $PTIME$, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in $TC^0$, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of chain-of-thought steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

TL;DR

This work establishes unconditional lower bounds on the length of chain-of-thought reasoning in hard-attention (UHAT) transformers across several algorithmic tasks, proving that CoTs must grow at least linearly with input size in many cases. Using random-restriction methods and a depth-reduction argument, the authors show that sublinear CoTs would force constant outputs on large input subsets, which is incompatible with parity-like functions. They apply these generic bounds to concrete problems—Parity, Multiplication, Median, and Reachability—demonstrating Omega(N) or Omega(N log N) CoT requirements, and show these bounds are tight up to polylog factors; they also analyze the limitations of dot-by-dot CoTs. Complementing theory with experiments on synthetic transformers and pretrained LLMs, the paper argues that sublinear CoTs are unlikely to suffice for the investigated tasks, motivating tool use or architectural innovations for efficient reasoning. Overall, the results delineate the power and limits of CoT reasoning in transformers and quantify the inference-time compute necessary for robust algorithmic reasoning.

Abstract

Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from to , their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in , such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of chain-of-thought steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

Paper Structure

This paper contains 72 sections, 19 theorems, 40 equations, 20 figures, 1 table.

Key Result

Theorem 3.3

Assume that $f$ has a UHAT-expressible CoT $g(x)$ of length $|g(x)| = o\left(|x|\right)$. Choose any $C \in (0,1)$. Then there is a restriction $\rho$ such that

Figures (20)

  • Figure 1: High-sensitivity problems such as Parity are difficult to learnably and generalizably represent for transformers. Such problems can be solved by CoTs decomposing them into a sequence of local steps. We study the length that such CoTs need to have, as a function of the input length.
  • Figure 2: Results for Parity. Left: CoTs consisting of prefix sum parities, accuracy as a function of input and CoT lengths. CoTs of length $\Theta(N)$ succeed even on very long inputs. Right: Dot-by-dot scratchpads help much less, even when much longer then the input (Theorem \ref{['theorem:dot-by-dot']}).
  • Figure 3: Accuracy of a transformer trained on autoregressive 12-digit multiplication (most significant digit at the left, zero-padded on the left), on a test set with 10K number pairs. High-sensitivity digits in the middle show substantially decreased accuracy even if digits at beginning and end are predicted exactly. See Figure \ref{['fig:multiplication-accuracy-by-digit']} for more.
  • Figure 4: Results for DAG reachability. Without a CoT, the model performs at chance level for all but very small graphs. An $\mathcal{O}(|E| \log |V|)$-length CoT leads to a perfect accuracy independent of the graph size.
  • Figure 5: Results for Median: CoTs consisting of the sorted first $\lfloor N/2 \rfloor$ numbers. Accuracy as a function of input and CoT lengths.
  • ...and 15 more figures

Theorems & Definitions (45)

  • Definition 2.1
  • Definition 2.2
  • Definition 3.2
  • Theorem 3.3: Generic CoT Bound
  • Lemma 3.4
  • Definition 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Definition 4.4
  • Theorem 4.5
  • ...and 35 more