Table of Contents
Fetching ...

To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

Kevin Xu, Issei Sato

TL;DR

This work formalizes a comparison between Chain-of-Thought (CoT) prompting and Looped Transformers, two approaches to enhancing reasoning in Transformer-based models. It shows that Looped Transformers inherently support parallel solutions for deterministic computations modeled as directed acyclic graphs, achieving efficiency through depth-scaled loops, while CoT with stochastic decoding excels at probabilistic inference for compositional problems, leveraging self-reducibility and sampling. The paper proves polylogarithmic-time separations between the two paradigms under carefully defined problem settings, and demonstrates concrete task instances illustrating Loop TF advantages, as well as probabilistic CoT advantages in approximate counting and sampling. The findings provide practical guidance on when to deploy depth-driven recursion versus iterative reasoning, and suggest directions for extending the analysis to evaluate practical FLOPs and space limitations.

Abstract

Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.

To CoT or To Loop? A Formal Comparison Between Chain-of-Thought and Looped Transformers

TL;DR

This work formalizes a comparison between Chain-of-Thought (CoT) prompting and Looped Transformers, two approaches to enhancing reasoning in Transformer-based models. It shows that Looped Transformers inherently support parallel solutions for deterministic computations modeled as directed acyclic graphs, achieving efficiency through depth-scaled loops, while CoT with stochastic decoding excels at probabilistic inference for compositional problems, leveraging self-reducibility and sampling. The paper proves polylogarithmic-time separations between the two paradigms under carefully defined problem settings, and demonstrates concrete task instances illustrating Loop TF advantages, as well as probabilistic CoT advantages in approximate counting and sampling. The findings provide practical guidance on when to deploy depth-driven recursion versus iterative reasoning, and suggest directions for extending the analysis to evaluate practical FLOPs and space limitations.

Abstract

Chain-of-Thought (CoT) and Looped Transformers have been shown to empirically improve performance on reasoning tasks and to theoretically enhance expressivity by recursively increasing the number of computational steps. However, their comparative capabilities are still not well understood. In this paper, we provide a formal analysis of their respective strengths and limitations. We show that Looped Transformers can efficiently simulate parallel computations for deterministic tasks, which we formalize as evaluation over directed acyclic graphs. In contrast, CoT with stochastic decoding excels at approximate inference for compositional structures, namely self-reducible problems. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical cues for choosing between reasoning paradigms.

Paper Structure

This paper contains 28 sections, 13 theorems, 10 equations, 2 figures, 1 table.

Key Result

Theorem 4.4

Let $\{G_n\}_{n \in \mathbb{N}}$ be a family of computation graphs, each computing a function $G_n : \Sigma^n \to \Sigma^{m(n)}$, and satisfying Assumptions ass:0 and ass:1. Then, for each $n \in \mathbb{N}$, there exists a log-precision CoT with parameter size bounded by $\mathsf{poly}(n)$, such th

Figures (2)

  • Figure 1: Comparison of simulating strategies for computing a DAG. (a) A computation graph. (b) A Looped TF simulates the graph in a layer-wise and parallel fashion, requiring a number of loops proportional to the depth of the graph. (c) A CoT simulates the computation sequentially in topological order, requiring a number of steps proportional to the size of the graph.
  • Figure 2: Illustration of next-token prediction in CoT as approximate relative counting over a self-reducible relation. Given a SAT instance $x$, self-reducibility enables reduction to a subproblem $x'$ based on the autoregressive prefix (e.g., $y_1 = 1$). The probability $\pi(y_2 = 1)$ is approximated via relative counting, with self-consistency aggregating predictions from multiple reasoning paths.

Theorems & Definitions (26)

  • Definition 2.1: Informal: CoT
  • Definition 2.2: Informal: Looped TF
  • Definition 3.1: Deterministic Problem
  • Definition 3.2: Probabilistic Problem
  • Definition 4.1: Computation Graph
  • Theorem 4.4: CoT for DAGs
  • Theorem 4.7: Looped TF for DAGs
  • Definition 4.8: $\mathsf{CoT}$ and $\mathsf{LOOP}$
  • Definition 4.9: $\mathsf{TC}^k$ and $\mathsf{NC}^k$
  • Theorem 4.10: li2024chain
  • ...and 16 more