Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang; Yusheng Liao; Ya Zhang; Yanfeng Wang; Yu Wang

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR

The paper tackles the overthinking problem in large reasoning models trained with verifiable rewards by revealing a misalignment between trajectory-level length penalties and token-level optimization. It introduces DECS, a framework that decouples token-level rewards from sequence-level signals and employs curriculum data scheduling to balance efficiency with reasoning power. Empirical results across seven benchmarks show DECS reduces reasoning tokens by over 50% while maintaining or improving Pass@1 and AES scores, with demonstrated transfer to out-of-domain tasks. This work provides a practical method to compress reasoning without sacrificing underlying reasoning capabilities, benefiting real-world deployments of large language models.

Abstract

While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

TL;DR

Abstract

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (8)