Table of Contents
Fetching ...

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR

The paper tackles the overthinking problem in large reasoning models trained with verifiable rewards by revealing a misalignment between trajectory-level length penalties and token-level optimization. It introduces DECS, a framework that decouples token-level rewards from sequence-level signals and employs curriculum data scheduling to balance efficiency with reasoning power. Empirical results across seven benchmarks show DECS reduces reasoning tokens by over 50% while maintaining or improving Pass@1 and AES scores, with demonstrated transfer to out-of-domain tasks. This work provides a practical method to compress reasoning without sacrificing underlying reasoning capabilities, benefiting real-world deployments of large language models.

Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

TL;DR

The paper tackles the overthinking problem in large reasoning models trained with verifiable rewards by revealing a misalignment between trajectory-level length penalties and token-level optimization. It introduces DECS, a framework that decouples token-level rewards from sequence-level signals and employs curriculum data scheduling to balance efficiency with reasoning power. Empirical results across seven benchmarks show DECS reduces reasoning tokens by over 50% while maintaining or improving Pass@1 and AES scores, with demonstrated transfer to out-of-domain tasks. This work provides a practical method to compress reasoning without sacrificing underlying reasoning capabilities, benefiting real-world deployments of large language models.

Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Paper Structure

This paper contains 51 sections, 4 theorems, 31 equations, 13 figures, 7 tables.

Key Result

Lemma 1

Let the actor policy $\pi_\theta$ be a tabular softmax policy and updated using Eq. policy_gradient with a learning rate $\eta$, the difference of $z_{{\bm{o}}_{<t},o_t}$ between two consecutive steps $m$ and $m+1$ satisfies

Figures (13)

  • Figure 1: Left: Two major flaws of prior practice apply sequence-level length reward without control of training data. Negative advantage values penalize correct high entropy tokens from long sequences while positive ones reward redundant tokens from short sequences; Middle: Flaws of length rewards lead to inferior performance and suboptimal efficiency gains on AIME2024 dataset; Right: DeCS improves pass@1 of base models while reducing $\sim 60\%$ token costs compared to the base model across 7 benchmarks. Experimental details are presented in Appendix \ref{['sec:intro_exp_details']}.
  • Figure 2: Overview of the DeCS training pipeline. (1) Decoupled Token-level Reward: We finetune a small language model to detect the necessary reasoning prefix (NRP) from other redundancy, which are separately rewarded to penalize overthinking consistently while maintaining the probability for generating necessary reasoning steps. (2) Curriculum Prompt Schedule: The number of easy prompts $q_{\theta,G}$ grows in step with the progressive decline in remaining redundancy.
  • Figure 3: (a) Ablation study with two major components of DeCS on the DS-1.5B base model. (b) Comparison of DeCS with other baselines on the proportion of NRP tokens (PRNP) and actual reasoning tokens in the AIME2024 testbed. (c) DeCS performs on par with the base policy (DS-1.5B) in terms of Pass@K scores on three challenging benchmarks.
  • Figure 4: (a) Average tokens and Pass@1 performance with 5 increasing generation budgets; (b) Frequency of reasoning behavior tokens after applying DeCS; (c) Consistent compression rates of DeCS on six difficulty levels sourced from MATH500 and AIME2024.
  • Figure 5: Prompt for the training and inference with the NRP detector
  • ...and 8 more figures

Theorems & Definitions (8)

  • Lemma 1: Difference of policy logits in vanilla policy gradient
  • Lemma 2: Decreased logits for correct high-entropy tokens
  • Theorem 1: Maintenance of High-entropy Tokens Under Batch Learning
  • Definition 1: Necessary Reasoning Prefix
  • Theorem 2: Suboptimal Reduction of Redundant Tokens
  • proof
  • proof
  • proof