Rethinking Chain-of-Thought from the Perspective of Self-Training
Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
TL;DR
The work identifies a core parallel between CoT reasoning and self-training: iteratively leveraging model-generated information to minimize prediction uncertainty. It introduces a two-module CoT framework—Task-Specific Prompt (TSP) to generate high-quality initial reasoning and Adaptive Reasoning Iteration (ARI) to refine reasoning while preventing over-reasoning and encouraging diversity. Through theoretical analysis of entropy dynamics and extensive experiments across ten datasets, the method demonstrates substantial gains over zero-shot and self-consistency baselines, with notable improvements in arithmetic tasks and improved efficiency. The approach provides practical guidance for controlling reasoning quality and exploration in LLMs, enhancing reliability and applicability of CoT in real-world tasks.
Abstract
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, \ie over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments demonstrate that the proposed method achieves significant advantages in both performance and computational efficiency.
