Table of Contents
Fetching ...

DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi, Dong Yu

TL;DR

DeepCompress introduces a dual-length reward and a model-aware difficulty mechanism to dynamically adjust Chain-of-Thought length during training of Large Reasoning Models. By classifying questions as Simple or Hard in real time using batch and group pass ratios, it applies short reasoning for Easy tasks and longer exploration for Hard ones, balancing accuracy and token efficiency. Across challenging math benchmarks, DeepCompress achieves state-of-the-art results with substantially reduced token usage and fosters higher policy entropy, indicating more effective exploration. The approach advances autonomous reasoning by enabling models to allocate reasoning effort adaptively, improving both performance and efficiency in complex problem solving.

Abstract

Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.

DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

TL;DR

DeepCompress introduces a dual-length reward and a model-aware difficulty mechanism to dynamically adjust Chain-of-Thought length during training of Large Reasoning Models. By classifying questions as Simple or Hard in real time using batch and group pass ratios, it applies short reasoning for Easy tasks and longer exploration for Hard ones, balancing accuracy and token efficiency. Across challenging math benchmarks, DeepCompress achieves state-of-the-art results with substantially reduced token usage and fosters higher policy entropy, indicating more effective exploration. The approach advances autonomous reasoning by enabling models to allocate reasoning effort adaptively, improving both performance and efficiency in complex problem solving.

Abstract

Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ``overthinking'' simple problems and ``underthinking'' complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbf{DeepCompress}, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as ``Simple'' or ``Hard'' in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for ``Simple'' problems while promoting longer, more exploratory thought chains for ``Hard'' problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.

Paper Structure

This paper contains 32 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Relationship between standardized response length (z) and mathematical reasoning performance (pass@k). Pass@1 score decreases with increasing length, while Pass@32 generally increases.
  • Figure 2: Reward values for our DeepCompress method. Subfigure (a) illustrates the reward for Simple Questions, and (b) for Hard Questions. For both, Blue indicates correct responses and Red indicates incorrect responses. The dashed line denotes the baseline outcome reward ($R_o$), while the solid line represents our final combined reward ($R = R_o + R_l$), effectively showcasing how our Dual Length Reward ($R_l$) dynamically modulates the reward signal based on standardized response length ($z$) and question difficulty ($\beta$).
  • Figure 3: Average Response Length across mathematical benchmarks. DeepCompress-Zero models achieve significantly shorter average outputs compared to DeepMath-Zero models.
  • Figure 4: Training dynamics and evaluation results of DeepCompress. (a) Policy entropy during training. (b) Average response length on training batches. (c) Average pass@1 score (%) on test sets. (d) Average response length on test sets.