Table of Contents
Fetching ...

Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

Zeju Li, Jianyuan Zhong, Ziyang Zheng, Xiangyu Wen, Zhijian Xu, Yingying Cheng, Fan Zhang, Qiang Xu

TL;DR

This work tackles the inefficiency of verbose Chain-of-Thought reasoning in large language models by introducing step entropy, a metric that quantifies the informational contribution of individual reasoning steps. By showing that many low-entropy steps are redundant, the authors demonstrate that up to 80% of such steps can be pruned with minimal impact on accuracy, yielding substantial token reductions across multiple model families and benchmarks. They further enable autonomous compressed CoT generation through a two-stage training pipeline combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), guiding models to insert [SKIP] tokens strategically. The approach delivers strong, model-agnostic improvements in inference efficiency while preserving reasoning quality, with detailed experiments on GSM8k, Math500, AIME, and MMLU that validate generalizability and scalability. A public code release accompanies the work, highlighting its practical relevance for scalable and transparent LLM deployments.

Abstract

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies \emph{the informational contribution of individual reasoning steps} to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly improves LLM inference efficiency while preserving accuracy, paving the way for more scalable LLM deployments and a better understanding of their internal reasoning. The code and data are released in https://github.com/staymylove/COT_Compresstion_via_Step_entropy.

Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

TL;DR

This work tackles the inefficiency of verbose Chain-of-Thought reasoning in large language models by introducing step entropy, a metric that quantifies the informational contribution of individual reasoning steps. By showing that many low-entropy steps are redundant, the authors demonstrate that up to 80% of such steps can be pruned with minimal impact on accuracy, yielding substantial token reductions across multiple model families and benchmarks. They further enable autonomous compressed CoT generation through a two-stage training pipeline combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), guiding models to insert [SKIP] tokens strategically. The approach delivers strong, model-agnostic improvements in inference efficiency while preserving reasoning quality, with detailed experiments on GSM8k, Math500, AIME, and MMLU that validate generalizability and scalability. A public code release accompanies the work, highlighting its practical relevance for scalable and transparent LLM deployments.

Abstract

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies \emph{the informational contribution of individual reasoning steps} to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly improves LLM inference efficiency while preserving accuracy, paving the way for more scalable LLM deployments and a better understanding of their internal reasoning. The code and data are released in https://github.com/staymylove/COT_Compresstion_via_Step_entropy.

Paper Structure

This paper contains 40 sections, 2 theorems, 10 equations, 4 figures, 9 tables.

Key Result

Lemma 1

Given a reasoning process where a sequence of steps $C = (S_1, S_2, \dots, S_N)$ leads to a final answer $A$, the conditional mutual information $I(S_j;A|\bar{S}_j)$ between the step and the answer, conditioned on all other steps $\bar{S}_j = C \setminus \{S_j\}$, is bounded by the step entropy $H(S

Figures (4)

  • Figure 1: Comprehensive Performance of COT Compression via Step Entropy. (a) Accuracy vs. Mask Ratio on 50 samples from DeepScaleR. This plot illustrates the impact of different pruning strategies (Random, High-Entropy Steps, Low-Entropy Steps) on final answer accuracy as the mask ratio of intermediate COT steps increases. Note that pruning up to 80% low-entropy steps maintains Complete COT accuracy (b) Accuracy vs. Tokens Usage Ratio on other benchmarks. This plot compares the accuracy and token usage ratio of the Full COT against our Compressed COT (80% low-entropy steps pruning) across Math500, AIME 2024, and AIME 2025 on DeepSeek-R1-7B.
  • Figure 2: Comparing the accuracy of Our Method (via step-entropy) and Directly Masking Tokens (via token-entropy) across various thinking token mask ratios, with Full COT serving as the baseline of Deepseek-R1-14B on DeepScaleR dataset.
  • Figure 3: Ablation study on different replacement strategies for pruned low-entropy steps. The experiment is conducted on DeepSeek-R1-7B with the same sampled data of DeepScaleR in Fig \ref{['fig:fist_fig']} (left), comparing four methods for handling pruned steps at high pruning ratios (80%, 85%, 90%).
  • Figure :

Theorems & Definitions (5)

  • Lemma 1: Entropy-Bounded Information
  • Theorem 1: Entropy-Bounded Information on Subset
  • Definition 1: Length-normalized Step Entropy
  • proof
  • proof