Table of Contents
Fetching ...

Entropy-Guided Reasoning Compression

Hourun Zhu, Yang Gao, Wenlong Fei, Jiawei Li, Huashan Sun

TL;DR

Entropy conflict arises between compression and accuracy objectives in reasoning models, analyzed through an information-theoretic lens and observed as opposing gradients on high-entropy tokens. The authors propose an entropy-guided framework with a compression stage that descends entropy and an enhancement stage that ascends it, leveraging length clipping, absolute-advantage updates, and exponent reward shaping, followed by a higher-temperature GRPO-based exploration and reward decomposition. Across six mathematical benchmarks and two model scales, this approach achieves roughly an 80% reduction in reasoning length while preserving or improving accuracy, with ablations confirming the necessity of stage ordering and reward design. The work provides a principled approach to efficient, robust reasoning in LRMs and offers guidance for multi-objective training regimes in large-scale models.

Abstract

Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.

Entropy-Guided Reasoning Compression

TL;DR

Entropy conflict arises between compression and accuracy objectives in reasoning models, analyzed through an information-theoretic lens and observed as opposing gradients on high-entropy tokens. The authors propose an entropy-guided framework with a compression stage that descends entropy and an enhancement stage that ascends it, leveraging length clipping, absolute-advantage updates, and exponent reward shaping, followed by a higher-temperature GRPO-based exploration and reward decomposition. Across six mathematical benchmarks and two model scales, this approach achieves roughly an 80% reduction in reasoning length while preserving or improving accuracy, with ablations confirming the necessity of stage ordering and reward design. The work provides a principled approach to efficient, robust reasoning in LRMs and offers guidance for multi-objective training regimes in large-scale models.

Abstract

Large reasoning models have demonstrated remarkable performance on complex reasoning tasks, yet the excessive length of their chain-of-thought outputs remains a major practical bottleneck due to high computation cost and poor deployability. Existing compression methods have achieved partial success but overlook a crucial phenomenon in the training process -- the entropy conflict. During compression training, entropy decreases, leading to shorter reasoning but limited exploration, while accuracy-oriented objectives increase entropy, lengthening reasoning chains. This can cause the model to get stuck in a local dilemma. Our analysis further reveals the origin of the entropy conflict: many high-entropy tokens are logical connectors that receive larger gradients and are encouraged under the performance objective, while the compression objective simultaneously penalizes these potentially redundant connectors. This opposing pressure creates a direct source of entropy conflict. To address these issues, we adopt an entropy-guided training framework. As entropy descends, the model is guided toward efficient reasoning by encouraging concise thought steps; as entropy rises, exploration is reinforced under the compact reasoning mode to improve robustness. Experiments on six mathematical benchmarks show that our method compresses reasoning length to 20% of the original while maintaining or even surpassing baseline accuracy. Code and models will be released publicly.

Paper Structure

This paper contains 23 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Entropy conflict in reasoning training. Compression objectives prefer shorter correct reasoning paths and lower entropy, whereas accuracy objectives accept all correct paths and raise entropy. This creates a fundamental tension between efficiency and correctness.
  • Figure 2: Overview of our method. The model first learns concise reasoning through compression. Then it enhances reasoning ability in the exploration phase via broader rollouts.
  • Figure 3: Overall empirical evidence of entropy conflict. (a) Distribution of samples selected by the compression objective (red) and accuracy objective (blue) during training. The detailed setup is provided in Section \ref{['entropy_conflict_setup']}. (b) Performance comparison on MATH under Entropy-Guided (EG) and Entropy-Entangled (EE) training. (c) Evaluation of EG, EE, and an extended EE* variant (trained for 400 additional steps) across four benchmarks (AIME24, Minerva, MATH, GSM8K). (d) Pearson correlation between token-level entropy and gradient magnitude during training. (e) Word-cloud visualization of some high-entropy tokens frequently encountered during training.
  • Figure 4: Comparison of the original model and our compressed model on the Math500 benchmark, grouped by four correctness transitions: preserved (✓→✓), lost (✓→×), gained (×→✓), and failed (×→×). The boxplots show the distribution of (a) token counts and (b) reasoning steps for both models within each group.
  • Figure 5: Case study on AIME24.