Table of Contents
Fetching ...

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

TL;DR

Compressed PagedAttention is introduced, a method that combines token-wise KV cache eviction with PagedAttention, and a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention is proposed.

Abstract

With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

TL;DR

Compressed PagedAttention is introduced, a method that combines token-wise KV cache eviction with PagedAttention, and a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention is proposed.

Abstract

With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1 speedup.
Paper Structure (31 sections, 4 equations, 30 figures, 5 tables, 4 algorithms)

This paper contains 31 sections, 4 equations, 30 figures, 5 tables, 4 algorithms.

Figures (30)

  • Figure 1: Illustration of Compressed PagedAttention. Here, $N_{\max}=4, b=4, w=2$. The figure depicts two requests requiring compression. After compression, the kept KV cache entries are moved to the first three blocks, while the fourth block is reserved for subsequent decoding. The remaining blocks are released.
  • Figure 2: State transition diagram of requests under hybrid scheduling.
  • Figure 3: Illustration of block allocation and release strategies for prefix cache.
  • Figure 4: Average time per step and ratio during inference with Qwen3 0.6B and Qwen3 8B on AMC 23 under non-asynchronous compression settings.
  • Figure 5: Comparison of average TPOT (ms) of all requests across different configurations on three workloads.
  • ...and 25 more figures