Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
Yuntian Tang, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Wenxi Li, Wei Li, Jie Hu, Xinghao Chen, Rongrong Ji, Shaohui Lin
TL;DR
This work tackles the high token cost of chain-of-thought reasoning in large language models by introducing Extra-CoT, a three-stage framework for extreme-ratio CoT compression. It combines a semantically-preserved, question-aware CoT compressor, mixed-ratio supervised fine-tuning to instill budget controllability, and CHRPO, a constrained hierarchical RL method that aggressively optimizes for accuracy under ultra-low budgets. The approach achieves substantial token reductions (e.g., over 73% on MATH-500) while improving or preserving accuracy across GSM8K, MATH-500, and AMC2023, and demonstrates strong robustness to long contexts and out-of-domain tasks. The results indicate that high-fidelity, efficient reasoning is feasible at extreme compression levels, enabling faster inference in resource-constrained settings with practical impact for mathematical reasoning tasks.
Abstract
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods.
