Table of Contents
Fetching ...

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin

TL;DR

This work proposes Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution to penalize low-importance steps more heavily while preserving high-importance reasoning.

Abstract

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

TL;DR

This work proposes Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution to penalize low-importance steps more heavily while preserving high-importance reasoning.

Abstract

Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
Paper Structure (31 sections, 10 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 31 sections, 10 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Step-level redundancy in reasoning. We segment the model's responses into steps by token length and use the change in the probability of generating the ground truth answer to measure each step's progress. (a) Distribution of step-wise information gain: most steps contribute little or no progress toward the correct answer, and high-gain steps are rare (4% with gain $>$ 0.5). (b) Longer reasoning trajectories contain a lower fraction of high-gain steps. (c) In SwAP, in addition to the outcome reward $r_o$, each step also gets assigned a progress reward $r_s$. When the response is correct but exceeds the reference length, the penalty (dashed grey area) gets distributed to each step based on its progress (darker shades represent more progress).
  • Figure 2: Comparison of Pass@1 accuracy vs. token usage across math benchmarks. SwAP establishes the Pareto frontier for best performance under every token budget.
  • Figure 3: Impact of step advantage weight $\theta$ on model performance and response length. A moderate value of $\theta \in [0.2, 0.4]$ achieves the best balance. Hard datasets such as AIME 2024/2025 are more sensitive to a large $\theta$ value.
  • Figure 4: Training dynamics of different model ablations. SwAP achieves the best validation performance, with a final average response length similar to Outcome-Only (length penalty assigned at the trajectory level).
  • Figure 5: Comparison of Pass@1 accuracy vs. step length across four math benchmarks.
  • ...and 1 more figures