Table of Contents
Fetching ...

Rethinking Chain-of-Thought from the Perspective of Self-Training

Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng

TL;DR

The work identifies a core parallel between CoT reasoning and self-training: iteratively leveraging model-generated information to minimize prediction uncertainty. It introduces a two-module CoT framework—Task-Specific Prompt (TSP) to generate high-quality initial reasoning and Adaptive Reasoning Iteration (ARI) to refine reasoning while preventing over-reasoning and encouraging diversity. Through theoretical analysis of entropy dynamics and extensive experiments across ten datasets, the method demonstrates substantial gains over zero-shot and self-consistency baselines, with notable improvements in arithmetic tasks and improved efficiency. The approach provides practical guidance for controlling reasoning quality and exploration in LLMs, enhancing reliability and applicability of CoT in real-world tasks.

Abstract

Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.

Rethinking Chain-of-Thought from the Perspective of Self-Training

TL;DR

The work identifies a core parallel between CoT reasoning and self-training: iteratively leveraging model-generated information to minimize prediction uncertainty. It introduces a two-module CoT framework—Task-Specific Prompt (TSP) to generate high-quality initial reasoning and Adaptive Reasoning Iteration (ARI) to refine reasoning while preventing over-reasoning and encouraging diversity. Through theoretical analysis of entropy dynamics and extensive experiments across ten datasets, the method demonstrates substantial gains over zero-shot and self-consistency baselines, with notable improvements in arithmetic tasks and improved efficiency. The approach provides practical guidance for controlling reasoning quality and exploration in LLMs, enhancing reliability and applicability of CoT in real-world tasks.

Abstract

Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.

Paper Structure

This paper contains 25 sections, 8 theorems, 11 equations, 8 figures, 9 tables, 1 algorithm.

Key Result

Lemma 3.1

Suppose $(x,y)\sim \mathcal{D}$ where $\mathcal{D}$ is a Gaussian mixture models in $\mathbb{R}^d\times\{\pm 1\}$ with mean $\mu$ satisfying $\left\Vert{\mu}\right\Vert=\Theta(1)$, i.e., $y\sim\mathrm{Unif}(\{\pm 1\})$ and $x|y\sim \mathcal{N}(y\mu,I)$. Let $\ell(z)=\log(1+\exp(-z))$, and assume $\s

Figures (8)

  • Figure 1: Both self-training and CoT reasoning iteratively leverage model-generated information (pseudo-labels or reasoning processes) to gradually reduce the uncertainty of predictions.
  • Figure 2: Visualizations of entropy variations in the iterative process of self-training and CoT reasoning. In the self-training diagram, the iterative process represents the gradual convergence of the initial classifier $\beta_{\mathrm{init}}$ toward the Bayes optimal classifier $\mu$. At each iteration, changes in the angle between the classifier and the samples in different regions correspond to entropy variations within those samples. In the CoT reasoning diagram, each node in the directed acyclic graph represents a computation, with red nodes indicating erroneous computations. Each ellipse denotes a set of leaf nodes, corresponding to semantically equivalent answers, and the numbers indicate their respective iteration rounds. Bold arrows are used to represent complete reasoning paths. As the iterations proceed, these paths are gradually corrected. The semantic entropy at each iteration is determined by the distribution of generated answers across different semantic categories.
  • Figure 3: Accuracy under varying levels of semantic entropy on the AQuA and LastLetters datasets (based on 100 sampled instances).
  • Figure 4: The flowchart of the proposed CoT framework consists of two key modules, i.e., Task-Specific Prompt (light purple block) and Adaptive Reasoning Iteration (light blue block). Specifically, the task-specific prompt module first utilizes LLMs to generate $m$ candidate prompts and evaluates their semantic entropy on the given dataset. The prompt with the lowest entropy is selected as the optimal prompt $\hat{p}$, providing guidance for the subsequent adaptive reasoning iteration module to produce high-quality initial reasoning. In the adaptive reasoning iteration module, the uncertainty is calculated at each iteration and compared to a predefined threshold $\delta$. This evaluation determines whether to accept the current prediction as the final output or to proceed to another iteration. If the uncertainty remains high, a new reasoning round is initiated with a new prompt $p^{\ast}$, designed to introduce diversity compared to previous reasoning steps. This iterative process continues until the uncertainty is substantially reduced or the maximum number of iterations is reached.
  • Figure 5: Accuracy and time costs of adaptive reasoning iteration compared to the fixed reasoning iteration on the AQuA dataset.
  • ...and 3 more figures

Theorems & Definitions (25)

  • Lemma 3.1
  • Theorem 3.2
  • Definition 3.3: Reasoning Structures
  • Definition 3.4: Semantic Entropy
  • Definition 3.5: Initial and Optimal Paths
  • Definition 3.6: Iterative CoT Reasoning
  • Definition 1.1: Gaussian Mixture Model
  • Lemma 1.2: Lemma \ref{['lemma:theta_t_change']}, restate
  • proof
  • Lemma 1.3: The Sample Complexity for Unlabeled Data, Theorem 3.6 in frei2022self
  • ...and 15 more