Table of Contents
Fetching ...

Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning

Jialiang Hong, Taihang Zhen, Kai Chen, Jiaheng Liu, Junlan Feng, Wenpeng Zhu, Jing Huo, Yang Gao, Depeng Wang, Haitao Wan, Xi Yang, Boyan Wang, Fanyu Meng, Yuyao Zhang

TL;DR

This work addresses overthinking in large reasoning models by decomposing CoT redundancy into Internal Redundancy (within the First Correct Solution) and External Redundancy (post-FCA). It introduces a dual-penalty reinforcement learning framework comprising a sliding-window Internal Redundancy Degree ($IRD$) and a normalized External Redundancy Degree ($ERD$) penalty to surgically shorten reasoning traces without harming accuracy. The authors define a minimal logical trajectory and quantify redundancy with IRD and ERD, enabling a targeted, thresholded internal penalty $p_{int}$ and a linear external penalty $p_{ext}$. Empirical results across GSM8K, MATH500, AIME24 and out-of-domain benchmarks show that external redundancy can be largely eliminated with negligible performance loss, while internal redundancy requires careful balancing to avoid deteriorating reasoning fidelity, thereby achieving more concise and interpretable CoT traces and improved efficiency. The approach generalizes across model scales and domains, offering a robust pathway to safer, more efficient CoT compression.

Abstract

Large Reasoning Models (LRMs) often suffer from overthinking, generating verbose reasoning traces that compromise both computational efficiency and interpretability. Unlike prior efforts that rely on global length-based rewards, we propose a semantic-aware decomposition of redundancy into two distinct forms: internal redundancy (informational stagnation within the reasoning process) and external redundancy (superfluous continuation after the final answer). We introduce a dual-penalty reinforcement learning framework that surgically targets these inefficiencies: a sliding-window semantic analysis is employed to penalize low-gain steps within the reasoning trajectory, while a normalized metric suppresses the post-answer tail. Extensive experiments demonstrate that our method significantly compresses Chain-of-Thought traces with minimal accuracy degradation, while maintaining strong generalization to out-of-domain tasks. Crucially, we reveal an asymmetry in redundancy: external redundancy can be safely eliminated without performance loss, whereas internal redundancy removal requires a calibrated trade-off to maintain reasoning fidelity. Our framework enables fine-grained, implicit control over reasoning length, paving the way for more concise and interpretable LRMs.

Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning

TL;DR

This work addresses overthinking in large reasoning models by decomposing CoT redundancy into Internal Redundancy (within the First Correct Solution) and External Redundancy (post-FCA). It introduces a dual-penalty reinforcement learning framework comprising a sliding-window Internal Redundancy Degree () and a normalized External Redundancy Degree () penalty to surgically shorten reasoning traces without harming accuracy. The authors define a minimal logical trajectory and quantify redundancy with IRD and ERD, enabling a targeted, thresholded internal penalty and a linear external penalty . Empirical results across GSM8K, MATH500, AIME24 and out-of-domain benchmarks show that external redundancy can be largely eliminated with negligible performance loss, while internal redundancy requires careful balancing to avoid deteriorating reasoning fidelity, thereby achieving more concise and interpretable CoT traces and improved efficiency. The approach generalizes across model scales and domains, offering a robust pathway to safer, more efficient CoT compression.

Abstract

Large Reasoning Models (LRMs) often suffer from overthinking, generating verbose reasoning traces that compromise both computational efficiency and interpretability. Unlike prior efforts that rely on global length-based rewards, we propose a semantic-aware decomposition of redundancy into two distinct forms: internal redundancy (informational stagnation within the reasoning process) and external redundancy (superfluous continuation after the final answer). We introduce a dual-penalty reinforcement learning framework that surgically targets these inefficiencies: a sliding-window semantic analysis is employed to penalize low-gain steps within the reasoning trajectory, while a normalized metric suppresses the post-answer tail. Extensive experiments demonstrate that our method significantly compresses Chain-of-Thought traces with minimal accuracy degradation, while maintaining strong generalization to out-of-domain tasks. Crucially, we reveal an asymmetry in redundancy: external redundancy can be safely eliminated without performance loss, whereas internal redundancy removal requires a calibrated trade-off to maintain reasoning fidelity. Our framework enables fine-grained, implicit control over reasoning length, paving the way for more concise and interpretable LRMs.

Paper Structure

This paper contains 38 sections, 8 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Response examples for AIME24. The underlined answer (e.g., "204") acts as a delimiter: content before it is the First Correct Solution (FCS) subject to internal redundancy; content after it is external redundancy. Our method generates a more efficient FCS (full content in Appendix \ref{['sec:example']}) while eliminating superfluous post-answer text compared to R1 and o1.
  • Figure 2: The Dual-redundancy Reward framework iteratively optimizes the LLM via complementary redundancy detection.
  • Figure 3: Analysis of ERD. Since human-written solutions inherently contain no external redundancy, we only report the ERD performance for the LRMs.
  • Figure 4: IRD Analysis on different datasets. The local similarity of DeepSeek-R1 is significantly higher than that of human answers and GPT-4o.
  • Figure 5: Performance on GPQA and LiveCodeBench. Our method generalizes well to out-of-domain reasoning tasks.
  • ...and 7 more figures