Think When You Need: Self-Adaptive Chain-of-Thought Learning
Junjie Yang, Ke Lin, Xing Yu
TL;DR
Think When You Need presents a self-adaptive Chain-of-Thought learning framework that replaces direct length penalties with a robust pairwise-reward strategy. The method computes sample-level rewards via $r(m_i) = \sum_{k \neq i} r_{ik}(m_i)$ across defined pairwise scenarios, enabling conciseness without sacrificing correctness and extending to fuzzy tasks where ground truth is unavailable. Empirical results on verifiable benchmarks (e.g., DeepScaleR, DAPO) and fuzzy tasks (AlpacaFarm) show substantial CoT length reductions while maintaining or improving accuracy, with larger models deriving greater efficiency benefits. The approach is theoretically grounded, integrates with existing reward structures, and offers practical implications for scalable, efficient reasoning in language models.
Abstract
Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to "think when needed."
