Table of Contents
Fetching ...

A Formal Comparison Between Chain-of-Thought and Latent Thought

Kevin Xu, Issei Sato

TL;DR

The paper formalizes and contrasts two Transformer-based reasoning paradigms: Chain-of-Thought (CoT), which generates explicit intermediate steps token-by-token, and Latent Thought in Looped Transformers, which iteratively refines latent representations without decoding. It proves a polylogarithmic-time separation in the deterministic setting, showing Latent Thought can evaluate computation graphs in depth by parallelizing across layers, whereas CoT requires steps proportional to graph size; this links Looped TF to circuit classes like AC^k and TC^k. In the stochastic setting, the authors show CoT with stochastic decoding can realize randomized approximation schemes (FPAUS/FPRAS) for self-reducible problems, while Looped TF cannot under standard complexity separations (FPTAS ⊊ FPRAS). Experiments on parallelizable tasks and a DNF-counting approximation task corroborate the theory, illustrating practical trade-offs: Latent Thought excels in parallel computation with modest loop counts, while CoT leverages stochastic sampling to tackle hard combinatorial problems. The work provides a principled guide for selecting reasoning paradigms depending on whether the problem favors parallel latent-space computation or probabilistic approximation via reasoning steps, with implications for designing future hierarchical or hardware-accelerated reasoning systems.

Abstract

Chain-of-Thought (CoT) elicits reasoning in large language models by explicitly generating intermediate steps in natural language. In contrast, Latent Thought in looped models operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that Latent Thought in Looped Transformers enables parallel computation, which is more efficient than the inherently sequential process of CoT. In contrast, CoT leverages stochastic decoding to approximate solutions to problems where exact computation is intractable. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms. Code is available at https://github.com/kevin671/cot-vs-loop.

A Formal Comparison Between Chain-of-Thought and Latent Thought

TL;DR

The paper formalizes and contrasts two Transformer-based reasoning paradigms: Chain-of-Thought (CoT), which generates explicit intermediate steps token-by-token, and Latent Thought in Looped Transformers, which iteratively refines latent representations without decoding. It proves a polylogarithmic-time separation in the deterministic setting, showing Latent Thought can evaluate computation graphs in depth by parallelizing across layers, whereas CoT requires steps proportional to graph size; this links Looped TF to circuit classes like AC^k and TC^k. In the stochastic setting, the authors show CoT with stochastic decoding can realize randomized approximation schemes (FPAUS/FPRAS) for self-reducible problems, while Looped TF cannot under standard complexity separations (FPTAS ⊊ FPRAS). Experiments on parallelizable tasks and a DNF-counting approximation task corroborate the theory, illustrating practical trade-offs: Latent Thought excels in parallel computation with modest loop counts, while CoT leverages stochastic sampling to tackle hard combinatorial problems. The work provides a principled guide for selecting reasoning paradigms depending on whether the problem favors parallel latent-space computation or probabilistic approximation via reasoning steps, with implications for designing future hierarchical or hardware-accelerated reasoning systems.

Abstract

Chain-of-Thought (CoT) elicits reasoning in large language models by explicitly generating intermediate steps in natural language. In contrast, Latent Thought in looped models operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that Latent Thought in Looped Transformers enables parallel computation, which is more efficient than the inherently sequential process of CoT. In contrast, CoT leverages stochastic decoding to approximate solutions to problems where exact computation is intractable. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms. Code is available at https://github.com/kevin671/cot-vs-loop.

Paper Structure

This paper contains 74 sections, 20 theorems, 116 equations, 5 figures, 2 tables.

Key Result

Theorem 3.5

Let $\{G_n\}_{n \in \mathbb{N}}$ be a family of computation graphs that satisfy Assumptions ass:-1 and ass:0. Then, for each $n \in \mathbb{N}$, there exists a log-precision CoT with parameter size bounded by $O(\mathrm{ff\_param}(G_n))$, such that for every input $x \in \Sigma^n$, the model outputs

Figures (5)

  • Figure 1: Comparison of strategies for evaluating DAG. (a) A computation graph $G_n$. (b) A Looped TF simulates the graph in a number of loops proportional to the depth of the graph $O(\mathrm{depth}(G_n))$. (c) A CoT simulates in a number of steps proportional to the size of the graph $O(\mathrm{size}(G_n))$.
  • Figure 2: Accuracy of $r$-looped TFs on the arithmetic evaluation (left) and the connectivity (right). Each curve shows the performance for a fixed loop count $r$ as the input size $n$ increases. Increasing the loop count improves accuracy, and a logarithmic number of loops suffices.
  • Figure 3: Relative error on DNF counting task.
  • Figure 4: Overview of the results. Left: deterministic setting, illustrating the relationship between Looped TFs with $\log^{k-1}(n)$ and $\log^k(n)$ loops, CoT with $\log^k(n)$ steps, and the corresponding circuit classes ($\mathsf{TC}^k$, $\mathsf{AC}^k$, $\mathsf{TC}^{k-1}$), all under the assumptions of $\log(n)$-precision and $\mathsf{poly}(n)$ size. Right: stochastic setting for a self-reducible relation $R$, where CoT with stochastic decoding realizes FPAUS for the uniform distribution over $R(x)$ or an FPRAS for $|R(x)|$ within $\mathsf{poly}(n)$ steps, whereas Looped TFs with stochastic decoding fail to achieve the same within $\mathsf{poly}(n)$ loops.
  • Figure 5: Illustration of the role of the attention and feedforward layers in Looped TFs for evaluating DAGs: the attention layer uniformly attends to and aggregates all inputs at each position, while the feedforward layer simultaneously simulates the functions of all DAG nodes in parallel.

Theorems & Definitions (59)

  • Definition 2.1: CoT
  • Definition 2.2: Looped Transformer
  • Definition 3.1: Computation graph
  • Definition 3.2: merrill2023parallelism
  • Theorem 3.5: CoT for DAGs
  • proof : Proof sketch
  • Theorem 3.7: Looped TF for DAGs
  • proof : Proof sketch
  • Definition 3.8: $\mathsf{CoT}$ and $\mathsf{Loop}$
  • Definition 3.9: Informal: Boolean Circuits
  • ...and 49 more