Table of Contents
Fetching ...

When does Chain-of-Thought Help: A Markovian Perspective

Zihan Wang, Yijun Dong, Qi Lei

TL;DR

This work analyzes when and why CoT helps and quantifies how noise in intermediate steps modulates CoT's benefit, and designs synthetic benchmarks that isolate transition alignment and noise in intermediate steps to complement prior results on real-world tasks.

Abstract

Chain-of-Thought (CoT) prompting is a widely used inference-time technique for improving reasoning, yet its gains are uneven across tasks. We analyze when and why CoT helps by modeling the step-wise reasoning trajectory as a Markov chain. Each intermediate step is a state and the dependence between steps is captured by a transition kernel. Our theory identifies transition alignment, whether instances share a common step-wise transition kernel, as the key determinant of CoT's effectiveness. When transitions are identical across steps, CoT reduces inference-time sample complexity: fewer context sample trajectories suffice to recover the final decision. In contrast, when transitions differ across steps, these gains can vanish. We further quantify how noise in intermediate steps modulates CoT's benefit. Beyond theory, we design synthetic benchmarks that isolate these factors to complement prior results on real-world tasks and to empirically validate our predictions.

When does Chain-of-Thought Help: A Markovian Perspective

TL;DR

This work analyzes when and why CoT helps and quantifies how noise in intermediate steps modulates CoT's benefit, and designs synthetic benchmarks that isolate transition alignment and noise in intermediate steps to complement prior results on real-world tasks.

Abstract

Chain-of-Thought (CoT) prompting is a widely used inference-time technique for improving reasoning, yet its gains are uneven across tasks. We analyze when and why CoT helps by modeling the step-wise reasoning trajectory as a Markov chain. Each intermediate step is a state and the dependence between steps is captured by a transition kernel. Our theory identifies transition alignment, whether instances share a common step-wise transition kernel, as the key determinant of CoT's effectiveness. When transitions are identical across steps, CoT reduces inference-time sample complexity: fewer context sample trajectories suffice to recover the final decision. In contrast, when transitions differ across steps, these gains can vanish. We further quantify how noise in intermediate steps modulates CoT's benefit. Beyond theory, we design synthetic benchmarks that isolate these factors to complement prior results on real-world tasks and to empirically validate our predictions.
Paper Structure (35 sections, 12 theorems, 33 equations, 5 figures)

This paper contains 35 sections, 12 theorems, 33 equations, 5 figures.

Key Result

Theorem 4.1

Under Assumption assum:init dist and assum:Q margin, if the number of samples $n$ satisfies then with probability at least $1-\delta$, we have $\hat{j}^*(i)=\arg\max_j Q_{ij}$ for all $i$ with direct inference.

Figures (5)

  • Figure 1: (a) Example of Markov-style decomposition. An illustrative instance, where the original text is selected from mavi2024multi, is parsed into an initial state $x_0$ and a sequence of relations/operators $r_{1:T}$. The correct intermediate steps $x_{1:T}$ are also included. The example highlights how text data are collapsed into a formalized sequence, making the intermediate operations explicit for CoT demonstrations. (b) Stepwise Markov dynamics. Each $r_t$ induces a transition kernel $P^{(t)}$ from $x_{t-1}$ to $x_t$ and the next node is generated randomly following the transition. At inference, the ground truth intermediate steps are the nodes with the largest probability $p^*$ for each step.
  • Figure 2: Synthetic alignment study (same vs. diff).Left: Accuracy as a function of the number of context samples $n$ comparing CoT (solid) and NonCoT (dashed). CoT yields a consistently larger improvement under same than under diff. Right: Sample-complexity proxy $\Delta n(\tau)$ (CoT$-$NonCoT) vs. target accuracy $\tau$. More negative values indicate fewer samples required by CoT. The widening gap in the same condition is consistent with the theoretical $1/T$-type gain for aligned kernels. Shaded regions denote mean $\pm$ SE.
  • Figure 3: Noise ablation under aligned transitions.Left: Accuracy vs. $n$ for CoT/NonCoT under small and large intermediate-step noise. Right:$\Delta n(\tau)$ vs. $\tau$. CoT reduces the required context budget more when noise is higher, reflecting that the global margin $\Delta_Q$ shrinks faster than the local margin $\Delta_P$ as stochasticity increases. Shaded regions denote mean $\pm$ SE.
  • Figure 4: Multi-step modular addition (practical check). Accuracy vs. $n$ for CoT and NonCoT when both steps add the same number versus different numbers (fixed modulus). The larger CoT gain in the aligned (same) case corroborates the synthetic findings in a more realistic arithmetic setting. Error bands show mean $\pm$ SE.
  • Figure 5: City--State rankings. Accuracy vs. $n$ (left) and CoT gain $\Delta\text{Acc}$ (right) under same-skill vs. diff-skill. Error bands: mean $\pm$ SE.

Theorems & Definitions (19)

  • Theorem 4.1: Direct inference
  • Theorem 4.2: Homogeneous CoT
  • Theorem 4.3
  • Remark 4.4
  • Lemma 5.1: Chapter 2, BoucheronLugosiMassart2013
  • Lemma 5.2
  • Remark 5.3
  • Lemma A.1: Multinomial top-1 identification (restatement of Lemma \ref{['lem:multinomial-top1']})
  • proof
  • Theorem B.1: Restatement of Theorem \ref{['thm:direct']}
  • ...and 9 more