Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao; Lu Wang; Chaoyun Zhang; Bo Qiao; Si Qin; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Huaiyu Wan

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Bo Qiao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan

TL;DR

Compressed PagedAttention is introduced, a method that combines token-wise KV cache eviction with PagedAttention, and a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention is proposed.

Abstract

With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper introduces Compressed PagedAttention, a method that combines token-wise KV cache eviction with PagedAttention. We propose a comprehensive scheduling strategy and support prefix caching and asynchronous compression for Compressed PagedAttention. Based on this, we have developed a high-concurrency LLM inference engine, Zipage. On large-scale mathematical reasoning tasks, Zipage achieves around 95\% of the performance of Full KV inference engines while delivering over 2.1$\times$ speedup.

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

TL;DR

Abstract

speedup.

Paper Structure (31 sections, 4 equations, 30 figures, 5 tables, 4 algorithms)

This paper contains 31 sections, 4 equations, 30 figures, 5 tables, 4 algorithms.

Introduction
Related Work
Background
Method
Compressed PagedAttention
The Compression Process Pipeline
Hybrid Scheduling
Shared Prefix Cache for Compressed PagedAttention
Asynchronous Decoding and Compression
Experiments
Experimental Setup
Efficiency Analysis
Comparison with Other Frameworks
How to Set KV Cache Budgets?
Discussion
...and 16 more sections

Figures (30)

Figure 1: Illustration of Compressed PagedAttention. Here, $N_{\max}=4, b=4, w=2$. The figure depicts two requests requiring compression. After compression, the kept KV cache entries are moved to the first three blocks, while the fourth block is reserved for subsequent decoding. The remaining blocks are released.
Figure 2: State transition diagram of requests under hybrid scheduling.
Figure 3: Illustration of block allocation and release strategies for prefix cache.
Figure 4: Average time per step and ratio during inference with Qwen3 0.6B and Qwen3 8B on AMC 23 under non-asynchronous compression settings.
Figure 5: Comparison of average TPOT (ms) of all requests across different configurations on three workloads.
...and 25 more figures

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

TL;DR

Abstract

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Authors

TL;DR

Abstract

Table of Contents

Figures (30)