Table of Contents
Fetching ...

Mitigating Overthinking through Reasoning Shaping

Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang

TL;DR

Mitigating Overthinking through Reasoning Shaping introduces Group Relative Segment Penalization (GRSP), a segment-level penalty with length-aware weighting for RLVR in LRMs. By supervising reasoning at the trajectory segment level and distributing penalties across length clusters, GRSP reduces unnecessary thinking while preserving test-time scaling benefits. Empirical results show GRSP improves token efficiency with minimal accuracy loss, especially on harder problems, and enhances training stability across model sizes. The approach offers a principled way to balance reasoning compression and solution quality in large reasoning systems.

Abstract

Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

Mitigating Overthinking through Reasoning Shaping

TL;DR

Mitigating Overthinking through Reasoning Shaping introduces Group Relative Segment Penalization (GRSP), a segment-level penalty with length-aware weighting for RLVR in LRMs. By supervising reasoning at the trajectory segment level and distributing penalties across length clusters, GRSP reduces unnecessary thinking while preserving test-time scaling benefits. Empirical results show GRSP improves token efficiency with minimal accuracy loss, especially on harder problems, and enhances training stability across model sizes. The approach offers a principled way to balance reasoning compression and solution quality in large reasoning systems.

Abstract

Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

Paper Structure

This paper contains 22 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustration of redundancy detection. (a) Identifying redundant tokens is challenging, as most of them are weakly correlated with the golden answer; (b) Identifying redundant steps is much easier for its clearer meaning.
  • Figure 2: The overall workflow of GRSP.
  • Figure 3: The ratio of segment counts across each cluster (correct vs. wrong). Longer segments are generally more prevalent in correct cases, and stronger models (a, b) exhibit flatter slopes compared to weaker ones (c, d).
  • Figure 4: Comparison of two weighting strategies. (a) Accuracy over training steps; (b) Average response length over training steps; (c) Average segment length over training steps.
  • Figure 5: Effect of GRSP across models of varying capacities, comparing changes in accuracy and average response length.
  • ...and 4 more figures