Table of Contents
Fetching ...

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen

TL;DR

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving tackles the production viability of KVcache compression for LLM serving. The authors survey existing quantization- and sparsity-based approaches, benchmark them under production-relevant frameworks, and introduce three tooling components (Throughput Predictor, Length Predictor, Negative Sample Evaluator) to guide deployment. They find that throughput gains can be limited or negated by longer outputs, and that negative samples and task sensitivity hamper universal applicability, underscoring the need for task-aware, production-centric evaluation. The work advances practical deployment by providing benchmarks, datasets, and open-source tools to better align research with real-world LLM serving needs.

Abstract

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

TL;DR

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving tackles the production viability of KVcache compression for LLM serving. The authors survey existing quantization- and sparsity-based approaches, benchmark them under production-relevant frameworks, and introduce three tooling components (Throughput Predictor, Length Predictor, Negative Sample Evaluator) to guide deployment. They find that throughput gains can be limited or negated by longer outputs, and that negative samples and task sensitivity hamper universal applicability, underscoring the need for task-aware, production-centric evaluation. The work advances practical deployment by providing benchmarks, datasets, and open-source tools to better align research with real-world LLM serving needs.

Abstract

Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.

Paper Structure

This paper contains 35 sections, 3 equations, 18 figures, 11 tables, 1 algorithm.

Figures (18)

  • Figure 1: Throughput analysis of LLaMA-7B: (a-b) The FP16 decoding throughput on TRL (with and without FlashAttention) and LMDeploy (LMD). (c-d) The speedup of the StreamingLLM algorithm on TRL and LMD. (e-h) The prefill throughput for various sizes of inputs. (i-l) The decoding throughput for various sizes of inputs.
  • Figure 2: Throughput analysis of LLaMA-70B on H800 GPUs.
  • Figure 3: The execution time of the attention layer of various compression algorithms measured across different prompt lengths.
  • Figure 4: The log-scaled distribution of response length difference over different compression algorithms and configurations.
  • Figure 5: The cumulative distribution function of the end-to-end latency (seconds) of various compression algorithms.
  • ...and 13 more figures