Table of Contents
Fetching ...

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera

TL;DR

The paper addresses how to allocate inference-time compute for reasoning in large language models by contrasting sequential chain-of-thought (CoT) reasoning with parallel ensemble approaches. It introduces a graph-connectivity reasoning task and two graph families to theoretically separate sequential versus parallel scaling, deriving results that sequential CoT with polynomial length can solve the problem while parallel aggregation over short CoTs cannot, under standard complexity assumptions. A Vertex Query Model is developed to analyze the fundamental tradeoffs, supported by empirical validations across models and by reinforcement learning that yields emergent longer CoTs. The findings imply that, for certain reasoning tasks, sequential scaling is exponentially more cost-effective, guiding the design of test-time reasoning strategies and suggesting a nuanced mix of sequential and parallel methods for real-world problems.

Abstract

Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

TL;DR

The paper addresses how to allocate inference-time compute for reasoning in large language models by contrasting sequential chain-of-thought (CoT) reasoning with parallel ensemble approaches. It introduces a graph-connectivity reasoning task and two graph families to theoretically separate sequential versus parallel scaling, deriving results that sequential CoT with polynomial length can solve the problem while parallel aggregation over short CoTs cannot, under standard complexity assumptions. A Vertex Query Model is developed to analyze the fundamental tradeoffs, supported by empirical validations across models and by reinforcement learning that yields emergent longer CoTs. The findings imply that, for certain reasoning tasks, sequential scaling is exponentially more cost-effective, guiding the design of test-time reasoning strategies and suggesting a nuanced mix of sequential and parallel methods for real-world problems.

Abstract

Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

Paper Structure

This paper contains 43 sections, 10 theorems, 13 equations, 14 figures.

Key Result

Theorem 1

Assume the complexity-theoretic statement that $\mathsf{TC}^{0} \not\supseteq \mathsf{L}$. Then the following is true for bounded-depth, limited-precision transformers.

Figures (14)

  • Figure 1: Model accuracy with different combinations of parallel and sequential scaling on a graph reasoning task. The sequential scale is the length budget for the chain of thought, and the parallel scale is the number of independent chains of thought generated. The left figure reports the performance of a small transformer model trained for this task, where aggregation is with best-of-$n$. The right figure shows the performance of the frontier DeepSeek-R1-Distill-Qwen-32B reasoning model r1, where aggregation is with majority vote. In both cases, there is a regime in which large increases to parallel scaling are required to compensate for a small decrease in sequential scaling; more details in \ref{['sec:experiments']}.
  • Figure 2: Left: an example "two-path" graph task from Theorem \ref{['thm:vqm-path-task']}. Right: the task input is a list of edges in randomized order and with randomly permuted vertex IDs.
  • Figure 3: Left: Example "Bridge" graph task from Definition \ref{['def:bridge-graph']}. Top right: the task input is a list of edges in randomized order and with randomly permuted vertex IDs. Bottom right: examples of chain of thought strategies used to train the model in our experiments in Section \ref{['sec:experiments']}.
  • Figure 4: (left) Evidence accuracy of CoT strategies on Bridge tasks of various depths compared to the probabilities of a DFS trace becoming the shortest path and a path respectively. (right) Evidence accuracy of CoT strategies with different sequential CoT budgets on Bridge(5) task. Models outputs are sampled with greedy decoding. Error bars represent 95% binomial confidence intervals.
  • Figure 5: (left) Majority decision and (right) Best-of-$n$ accuracy for parallel scaling of models trained with CoT strategies for Bridge(3) task, compared to the accuracy predicted by each CoT independently meeting the evidence criteria with the probability of a DFS trace becoming the shortest path and a path respectively.
  • ...and 9 more figures

Theorems & Definitions (22)

  • Definition 1: $(s,t)$-connectivity problem
  • Definition 2: $(s,t_1,t_2)$-connectivity problem
  • Theorem 1: Informal statement of \ref{['thm:expressivity-separation']}
  • Definition 3: Vertex Query Model
  • Theorem 2: Minimum number of VQM queries needed for graph connectivity
  • Definition 4: Bridge Graph
  • Theorem 3
  • Definition 5: $\mathsf{TC}^{0}$ computational model
  • Proposition 1: Transformers are in $\mathsf{TC}^{0}$; implied by Theorem 14 of chiang2024transformers
  • Lemma 1: Approximating the autoregressive distribution of a transformer
  • ...and 12 more