Table of Contents
Fetching ...

Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization

Xinhao Yao, Ruifeng Ren, Yun Liao, Yong Liu

TL;DR

The paper tackles how explicit CoT training improves reasoning generalization by internalizing stepwise reasoning into a two-stage circuit within transformers. It combines controlled synthetic data experiments with logit-lens and causal-tracing analyses to reveal a circuit whose stages match training steps, and it provides information-theoretic generalization bounds showing improved OOD generalization when CoT data cover relevant subtasks. Real-world validation via LoRA-finetuned models on GSM8K corroborates significant performance gains even with noisy CoT data. The findings illuminate mechanisms behind CoT's robustness and offer guidance for designing effective CoT strategies in LLMs.

Abstract

The integration of explicit Chain-of-Thought (CoT) reasoning into training large language models (LLMs) has advanced their reasoning capabilities, yet the mechanisms by which CoT enhances generalization remain poorly understood. This work investigates (1) \textit{how} CoT training reshapes internal model representations and (2) \textit{why} it improves both in-distribution (ID) and out-of-distribution (OOD) reasoning generalization. Through controlled experiments and theoretical analysis, we derive the following key insights. \textbf{1)} Structural Advantage: CoT training internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. \textbf{2)} Theoretical Analysis: the information-theoretic generalization bounds via distributional divergence can be decomposed into ID and OOD components. While ID error diminishes with sufficient training regardless of CoT, OOD error critically depends on CoT: Non-CoT training fails to generalize to OOD samples due to unseen reasoning patterns, whereas CoT training achieves near-perfect OOD generalization by mastering subtasks and reasoning compositions during training. The identified mechanisms explain our experimental results: CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. These findings are further validated on complex real-world datasets. This paper offers valuable insights for designing CoT strategies to enhance LLM reasoning robustness.

Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization

TL;DR

The paper tackles how explicit CoT training improves reasoning generalization by internalizing stepwise reasoning into a two-stage circuit within transformers. It combines controlled synthetic data experiments with logit-lens and causal-tracing analyses to reveal a circuit whose stages match training steps, and it provides information-theoretic generalization bounds showing improved OOD generalization when CoT data cover relevant subtasks. Real-world validation via LoRA-finetuned models on GSM8K corroborates significant performance gains even with noisy CoT data. The findings illuminate mechanisms behind CoT's robustness and offer guidance for designing effective CoT strategies in LLMs.

Abstract

The integration of explicit Chain-of-Thought (CoT) reasoning into training large language models (LLMs) has advanced their reasoning capabilities, yet the mechanisms by which CoT enhances generalization remain poorly understood. This work investigates (1) \textit{how} CoT training reshapes internal model representations and (2) \textit{why} it improves both in-distribution (ID) and out-of-distribution (OOD) reasoning generalization. Through controlled experiments and theoretical analysis, we derive the following key insights. \textbf{1)} Structural Advantage: CoT training internalizes reasoning into a two-stage generalizing circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. \textbf{2)} Theoretical Analysis: the information-theoretic generalization bounds via distributional divergence can be decomposed into ID and OOD components. While ID error diminishes with sufficient training regardless of CoT, OOD error critically depends on CoT: Non-CoT training fails to generalize to OOD samples due to unseen reasoning patterns, whereas CoT training achieves near-perfect OOD generalization by mastering subtasks and reasoning compositions during training. The identified mechanisms explain our experimental results: CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. These findings are further validated on complex real-world datasets. This paper offers valuable insights for designing CoT strategies to enhance LLM reasoning robustness.

Paper Structure

This paper contains 30 sections, 9 theorems, 36 equations, 6 figures, 4 tables.

Key Result

Theorem 1

Under the conditions specified in Assumption assumption1 and Definition definition:error, we assume that the loss $\ell(w, Z)$ is $R$-subGaussian for any $w\in \mathcal{W}\in \mathbb{R}^d$, then the expected generalization error is bounded by: where $N$ is the training data size, $Z=(X,Y)$ and $\mathcal{W}$ is the space of hypotheses related to the model. $\alpha$ denotes the mixing coefficient o

Figures (6)

  • Figure 1: The model generalization ability under controllable data settings (Section \ref{['subsec:cot vs. nocot']}). Left Part: the accuracy comparison on two-hop reasoning between training without CoT (Left) and training with CoT (Center Left), $\lambda=7.2$. CoT training significantly accelerates convergence and improves generalization from ID to OOD reasoning. Right Part: the impact of two-hop/one-hop ratio $\lambda$ (Center Right) and model scale (Right) on OOD generalization. Ratio $\lambda$ correlates with OOD generalization speed, while larger models converge more quickly without altering reasoning behavior.
  • Figure 2: The generalizing circuit (layer:8) for two-hop facts. We analyze individual states and assess the strength of connections between hidden states. Left: training without CoT, the circuit emerges only during ID generalization, with the intermediate result $e_2$ being resolved at $\text{layer index}=5$. Right: training with CoT, the model achieves ID/OOD generalization via a two-stage circuit. It is noted that intermediate result $e_2$ is resolved at $\text{layer index} = l$, where $l = 3$ for ID and $l = 5$ for OOD. Intuitively, smaller $\text{layer index}$ imply that more layers remain available for processing the second hop, potentially leading to better performance.
  • Figure 3: Left Part: the impact of only the second-hop noise on ID (Left) and OOD (Center Left) generalization. Noise has a significant impact on the final performance on both ID and OOD test data. However, the generalization trends for ID and OOD differ under noisy conditions. Right Part: the model’s accuracy on training and testing two-hop reasoning facts at different noise ratios (both hops are noisy). It compares the results for $\xi$ values of 0.05 (Center Right), 0.1 (Right).
  • Figure 4: The two-stage generalizing circuit (layer:2) for two-hop facts. We use logit lens to interpret individual states, and use causal tracing to measure the strength of connections between states. It is evident that a two-layer model can learn generalizing circuits from CoT training. This aligns with cabannes2024iteration, who describe how a certain distribution of weights within the first two attention layers of a transformer, referred to as an “iteration head,” enables a transformer to solve iterative tasks with CoT reasoning with relative ease. We identify CoT reasoning circuits in a transformer with only 2 layers. The output of first stage is autoregressively used for the second stage.
  • Figure 5: The model's accuracy on training and testing two-hop reasoning facts at different noise ratios (both hops are noisy). This figure compares the results for $\xi$ values of $0.05, 0.1, 0.2$, and $0.4$.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Definition 1: Data Distribution
  • Theorem 1: Generalization Bounds via Distributional Divergence
  • Remark 1: Theoretical Analysis of ID/OOD Generalization with/without CoT
  • Theorem 2: OOD Generalization Error for Training with CoT
  • Remark 2: Robustness Discussion of Training with CoT
  • Lemma 1: KL Divergence Decomposition
  • proof
  • Definition 2: Expected Generalization Error
  • Remark 3
  • Lemma 2: Donsker and Varadhan’s variational formula
  • ...and 11 more