R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
Yibo Wang, Haotian Luo, Huanjin Yao, Tiansheng Huang, Haiying He, Rui Liu, Naiqiang Tan, Jiaxing Huang, Xiaochun Cao, Dacheng Tao, Li Shen
TL;DR
R1-Compress tackles the inefficiency of Long-CoT by introducing a chunk-level compression framework that preserves local reasoning signals and cross-chunk coherence. It combines inner-chunk compression with an inter-chunk search to assemble a short, coherent CoT, and then fine-tunes models on the compressed traces. Across MATH500, AIME24, and GPQA-Diamond, it achieves substantial token reductions with accuracy close to Long-CoT (e.g., 92.4% on MATH500 with only a 0.6% drop and ~20% token reduction on a 32B model), indicating strong efficiency gains without sacrificing reasoning quality. The approach demonstrates robust performance across model scales and out-of-distribution tasks, underscoring the practicality of chunk-level CoT compression for scalable reasoning systems.
Abstract
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress
