Table of Contents
Fetching ...

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen, Rongrong Ji, Shaohui Lin

TL;DR

This work tackles the high token cost of chain-of-thought reasoning in large language models by introducing Extra-CoT, a three-stage framework for extreme-ratio CoT compression. It combines a semantically-preserved, question-aware CoT compressor, mixed-ratio supervised fine-tuning to instill budget controllability, and CHRPO, a constrained hierarchical RL method that aggressively optimizes for accuracy under ultra-low budgets. The approach achieves substantial token reductions (e.g., over 73% on MATH-500) while improving or preserving accuracy across GSM8K, MATH-500, and AMC2023, and demonstrates strong robustness to long contexts and out-of-domain tasks. The results indicate that high-fidelity, efficient reasoning is feasible at extreme compression levels, enabling faster inference in resource-constrained settings with practical impact for mathematical reasoning tasks.

Abstract

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

TL;DR

This work tackles the high token cost of chain-of-thought reasoning in large language models by introducing Extra-CoT, a three-stage framework for extreme-ratio CoT compression. It combines a semantically-preserved, question-aware CoT compressor, mixed-ratio supervised fine-tuning to instill budget controllability, and CHRPO, a constrained hierarchical RL method that aggressively optimizes for accuracy under ultra-low budgets. The approach achieves substantial token reductions (e.g., over 73% on MATH-500) while improving or preserving accuracy across GSM8K, MATH-500, and AMC2023, and demonstrates strong robustness to long contexts and out-of-domain tasks. The results indicate that high-fidelity, efficient reasoning is feasible at extreme compression levels, enabling faster inference in resource-constrained settings with practical impact for mathematical reasoning tasks.

Abstract

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.
Paper Structure (34 sections, 11 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 11 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between accuracy and actual compression ratio of CoT tokens, defined as the ratio of the compressed CoT token length to the original length, across three math benchmarks evaluated on Qwen3-1.7B. Extra-CoT outperforms TokenSkip and Thinkless in the extremely low-ratio regime. CHRPO policy further improves performance at the lowest inference budgets, validating the effectiveness of our RL optimization.
  • Figure 2: Overall pipeline of the proposed Extra-CoT, which includes three-stage training: (a) Semantically-preserved, question-aware CoT compressor training, (b) Mixed-ratio SFT and (c) CHRPO. We first train a CoT compressor on mathematical CoT data with fine-grained annotations to generate in-domain fixed-ratio compressed data. During mixed-ratio SFT stage, a reasoning LLM is fine-tuned on these fixed-ratio data combined with ratio-balanced warm-up data, teaching it to follow a spectrum of compression budgets and providing a stable initialization for the final stage. The final stage employs CHRPO to refine the model by using an accuracy-driven strategy to set teacher budgets and explicitly rewarding high accuracy in ultra-low compression regimes, thus incentivizing correct solutions.
  • Figure 3: An illustration of our proposed CHRPO's hierarchical reward mechanism, which features a main reward and a control-head reward. The main reward, targeting all tokens, integrates four criteria: accuracy, rationale integrity, budget calibration, and rationale-optimized mode. In contrast, the control-head reward is applied only to the first token, providing a direct and immediate signal to shape the policy's ratio selection.
  • Figure 4: Comparison of output quality between our compressor and LLMLingua-2 at 0.2 and 0.4 compression ratios. While our compressor produces a coherent and semantically faithful output that preserves structural and formula integrity, LLMLingua-2's output degrades into a fragmented text with semantic discontinuities and incomplete formulas.
  • Figure 5: Compressor quality comparison between our method (Ours) and LLMLingua-2. Both compressors were used to compress the same dataset at four fixed compression ratios. LLMs then scored the outputs on a 1-5 scale across three metrics: Math Fidelity, Reasoning Coherence, and Clarity & Readability.
  • ...and 2 more figures