Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Jiwon Song; Dongwon Jo; Yulhwa Kim; Jae-Joon Kim

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim

TL;DR

Reasoning Path Compression (RPC) tackles the memory and throughput bottlenecks of long reasoning trajectories in LLMs by training-free periodic KV-cache compression. It leverages semantic sparsity in reasoning traces and uses a selector window to score and retain only the most impactful tokens, preserving recent context while discarding redundant entries. Empirical results show RPC achieves up to 1.68× throughput and over 50% peak-memory reduction with minimal accuracy loss (as low as 1.2% on AIME 2024 for 32B models), outperforming training-based baselines and prior KV-cache methods. This approach enables more practical deployment of reasoning LLMs for long-form generation and complex problem solving.

Abstract

Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and reduce throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining cache entries that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2\% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

TL;DR

Abstract

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)