Table of Contents
Fetching ...

Think When You Need: Self-Adaptive Chain-of-Thought Learning

Junjie Yang, Ke Lin, Xing Yu

TL;DR

Think When You Need presents a self-adaptive Chain-of-Thought learning framework that replaces direct length penalties with a robust pairwise-reward strategy. The method computes sample-level rewards via $r(m_i) = \sum_{k \neq i} r_{ik}(m_i)$ across defined pairwise scenarios, enabling conciseness without sacrificing correctness and extending to fuzzy tasks where ground truth is unavailable. Empirical results on verifiable benchmarks (e.g., DeepScaleR, DAPO) and fuzzy tasks (AlpacaFarm) show substantial CoT length reductions while maintaining or improving accuracy, with larger models deriving greater efficiency benefits. The approach is theoretically grounded, integrates with existing reward structures, and offers practical implications for scalable, efficient reasoning in language models.

Abstract

Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to "think when needed."

Think When You Need: Self-Adaptive Chain-of-Thought Learning

TL;DR

Think When You Need presents a self-adaptive Chain-of-Thought learning framework that replaces direct length penalties with a robust pairwise-reward strategy. The method computes sample-level rewards via across defined pairwise scenarios, enabling conciseness without sacrificing correctness and extending to fuzzy tasks where ground truth is unavailable. Empirical results on verifiable benchmarks (e.g., DeepScaleR, DAPO) and fuzzy tasks (AlpacaFarm) show substantial CoT length reductions while maintaining or improving accuracy, with larger models deriving greater efficiency benefits. The approach is theoretically grounded, integrates with existing reward structures, and offers practical implications for scalable, efficient reasoning in language models.

Abstract

Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to "think when needed."

Paper Structure

This paper contains 20 sections, 15 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Illustration of the proposed algorithm and verifiable task comparison. The approach involves sampling $N$ candidate answers and performing comprehensive pairwise comparisons between them. The final reward for each candidate is computed as the summation of all pairwise rewards obtained when compared against the other answers.
  • Figure 2: Fuzzy task comparison scenarios. When compare short better and long worse answers, short better answer obtain reward of $\alpha+\beta$ while the long worse one receives the $-\alpha-\beta$. When compare long better and short worse answers, long better answer obtain reward of $\alpha$ while the short worse one receives the $-\alpha$.
  • Figure 3: Performance comparison in the DeepScaleR setting. Our method achieves comparable test accuracy to the baseline while significantly reducing response length during both training and testing phases. Solid lines represent results for the 1.5B model, while dashed lines represent the 7B model.
  • Figure 4: Performance comparison in the 8K DAPO setting. Our method maintains or slightly improves test accuracy compared to the baseline while substantially reducing response length. Solid lines represent results for the 1.5B model, while dashed lines represent the 7B model.
  • Figure 5: Performance comparison in the 16K DAPO setting across different benchmarks. Our method maintains comparable test accuracy while substantially reducing response length across both AIME 2024 and MATH 500 benchmarks. Solid lines represent the 1.5B model, while dashed lines represent the 7B model.
  • ...and 5 more figures