Table of Contents
Fetching ...

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Lu Ye, Ze Tao, Yong Huang, Yang Li

TL;DR

The paper tackles the self-attention bottleneck in long-context LLM inference by leveraging shared system prompts through a prefix-aware KV cache organized as a prefix tree (PAKV) and a two-phase partition (TPP) kernel. This design enables runtime detection and sharing of KV tensors across requests with common prefixes, reducing memory waste and improving data locality during decoding. Empirical results show 3.2–4.8× kernel speedups and 70–90% KV-cache memory reductions when long prompts are shared, with end-to-end throughput gains of 1.6–2.3× on real models and deployments. The work demonstrates practical, scalable improvements for multi-tenant LLM serving and provides a concrete pathway to faster, memory-efficient self-attention in long-context settings.

Abstract

Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the state-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

TL;DR

The paper tackles the self-attention bottleneck in long-context LLM inference by leveraging shared system prompts through a prefix-aware KV cache organized as a prefix tree (PAKV) and a two-phase partition (TPP) kernel. This design enables runtime detection and sharing of KV tensors across requests with common prefixes, reducing memory waste and improving data locality during decoding. Empirical results show 3.2–4.8× kernel speedups and 70–90% KV-cache memory reductions when long prompts are shared, with end-to-end throughput gains of 1.6–2.3× on real models and deployments. The work demonstrates practical, scalable improvements for multi-tenant LLM serving and provides a concrete pathway to faster, memory-efficient self-attention in long-context settings.

Abstract

Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8 compared to the state-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.
Paper Structure (15 sections, 2 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 15 sections, 2 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: KV cache in prefix tree. The instructions and examples in prompts of $S_0, S_1, S_2$ are common and sharable. Questions are different and not sharable. Some memory is unused due to alignment.
  • Figure 2: Two-phase partition kernel in ChunkAttention. The server is decoding sequences $S_0$, $S_1$, and $S_2$. They share chunks $C_0$, $C_1$ and $C_2$. In the chunk-first phase, queries $\bm{q}_0$, $\bm{q}_1$ and $\bm{q}_2$ are batched for self-attention with $C_0$, $C_1$ and $C_2$. Partial attention result $\bm{O}^{(C)}$, $\bm{m}^{(C)}$ and $\bm{n}^{(C)}$ are saved into memory. In the sequence-first phase, $\bm{o}_i$, $m_i$, and $n_i$ for each sequence are restored, and we continue processing the remaining chunks with respect to $\bm{q}_i$ only.
  • Figure 3: Throughput in token rate when generating up to $n_c$ completion tokens, given $n_s$ prefix tokens are shared. Chunk size $c$=64, batch size $b$=32.
  • Figure 4: Token rate when decoding up to $n_c$=64 completion tokens given various batch sizes. Chunk size $c$=64.
  • Figure 5: Normalized latency given different request arrival rates (RPS). Each line is marked by the system and shared prompt token count: system($n_s$).