Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning
Jialiang Hong, Taihang Zhen, Kai Chen, Jiaheng Liu, Junlan Feng, Wenpeng Zhu, Jing Huo, Yang Gao, Depeng Wang, Haitao Wan, Xi Yang, Boyan Wang, Fanyu Meng, Yuyao Zhang
TL;DR
This work addresses overthinking in large reasoning models by decomposing CoT redundancy into Internal Redundancy (within the First Correct Solution) and External Redundancy (post-FCA). It introduces a dual-penalty reinforcement learning framework comprising a sliding-window Internal Redundancy Degree ($IRD$) and a normalized External Redundancy Degree ($ERD$) penalty to surgically shorten reasoning traces without harming accuracy. The authors define a minimal logical trajectory and quantify redundancy with IRD and ERD, enabling a targeted, thresholded internal penalty $p_{int}$ and a linear external penalty $p_{ext}$. Empirical results across GSM8K, MATH500, AIME24 and out-of-domain benchmarks show that external redundancy can be largely eliminated with negligible performance loss, while internal redundancy requires careful balancing to avoid deteriorating reasoning fidelity, thereby achieving more concise and interpretable CoT traces and improved efficiency. The approach generalizes across model scales and domains, offering a robust pathway to safer, more efficient CoT compression.
Abstract
Large Reasoning Models (LRMs) often suffer from overthinking, generating verbose reasoning traces that compromise both computational efficiency and interpretability. Unlike prior efforts that rely on global length-based rewards, we propose a semantic-aware decomposition of redundancy into two distinct forms: internal redundancy (informational stagnation within the reasoning process) and external redundancy (superfluous continuation after the final answer). We introduce a dual-penalty reinforcement learning framework that surgically targets these inefficiencies: a sliding-window semantic analysis is employed to penalize low-gain steps within the reasoning trajectory, while a normalized metric suppresses the post-answer tail. Extensive experiments demonstrate that our method significantly compresses Chain-of-Thought traces with minimal accuracy degradation, while maintaining strong generalization to out-of-domain tasks. Crucially, we reveal an asymmetry in redundancy: external redundancy can be safely eliminated without performance loss, whereas internal redundancy removal requires a calibrated trade-off to maintain reasoning fidelity. Our framework enables fine-grained, implicit control over reasoning length, paving the way for more concise and interpretable LRMs.
