Table of Contents
Fetching ...

R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

Yibo Wang, Haotian Luo, Huanjin Yao, Tiansheng Huang, Haiying He, Rui Liu, Naiqiang Tan, Jiaxing Huang, Xiaochun Cao, Dacheng Tao, Li Shen

TL;DR

R1-Compress tackles the inefficiency of Long-CoT by introducing a chunk-level compression framework that preserves local reasoning signals and cross-chunk coherence. It combines inner-chunk compression with an inter-chunk search to assemble a short, coherent CoT, and then fine-tunes models on the compressed traces. Across MATH500, AIME24, and GPQA-Diamond, it achieves substantial token reductions with accuracy close to Long-CoT (e.g., 92.4% on MATH500 with only a 0.6% drop and ~20% token reduction on a 32B model), indicating strong efficiency gains without sacrificing reasoning quality. The approach demonstrates robust performance across model scales and out-of-distribution tasks, underscoring the practicality of chunk-level CoT compression for scalable reasoning systems.

Abstract

Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress

R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

TL;DR

R1-Compress tackles the inefficiency of Long-CoT by introducing a chunk-level compression framework that preserves local reasoning signals and cross-chunk coherence. It combines inner-chunk compression with an inter-chunk search to assemble a short, coherent CoT, and then fine-tunes models on the compressed traces. Across MATH500, AIME24, and GPQA-Diamond, it achieves substantial token reductions with accuracy close to Long-CoT (e.g., 92.4% on MATH500 with only a 0.6% drop and ~20% token reduction on a 32B model), indicating strong efficiency gains without sacrificing reasoning quality. The approach demonstrates robust performance across model scales and out-of-distribution tasks, underscoring the practicality of chunk-level CoT compression for scalable reasoning systems.

Abstract

Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress

Paper Structure

This paper contains 32 sections, 6 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Pipeline of our method. The Long-CoT is segmented into chunks, multiple compressed candidates for each chunk are generated using a LLM, and then a compressed CoT is constructed chunk by chunk through inter-chink search with length filtering and probability selection.
  • Figure 2: Comparison of LongCoT, CoT-Valve, and C3oT. Red text indicates reflection-related phrases such as “Wait”.
  • Figure 3: Left: Example of TokenSkip CoT Compression, Right: Token-level loss curves of Long-CoT and TokenSkip.
  • Figure 4: Token-level loss visualization.