Table of Contents
Fetching ...

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

Abstract

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and model accuracy metrics. We further map techniques to seven practical deployment scenarios, including long-context single requests, high-throughput datacenter serving, edge devices, multi-turn conversations, and accuracy-critical reasoning, providing actionable guidance for practitioners selecting among competing approaches. Our analysis reveals that no single technique dominates across all settings; instead, the optimal strategy depends on context length, hardware constraints, and workload characteristics, pointing toward adaptive, multi-stage optimization pipelines as a promising direction for future research.

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Abstract

The key-value (KV) cache is a foundational optimization in Transformer-based large language models (LLMs), eliminating redundant recomputation of past token representations during autoregressive generation. However, its memory footprint scales linearly with context length, imposing critical bottlenecks on GPU memory capacity, memory bandwidth, and inference throughput as production LLMs push context windows from thousands to millions of tokens. Efficient KV cache management has thus become a first-order challenge for scalable LLM deployment. This paper provides a systematic review of recent KV cache optimization techniques, organizing them into five principal directions: cache eviction, cache compression, hybrid memory solutions, novel attention mechanisms, and combination strategies. For each category we analyze the underlying mechanisms, deployment trade-offs, and empirical performance across memory reduction, throughput, and model accuracy metrics. We further map techniques to seven practical deployment scenarios, including long-context single requests, high-throughput datacenter serving, edge devices, multi-turn conversations, and accuracy-critical reasoning, providing actionable guidance for practitioners selecting among competing approaches. Our analysis reveals that no single technique dominates across all settings; instead, the optimal strategy depends on context length, hardware constraints, and workload characteristics, pointing toward adaptive, multi-stage optimization pipelines as a promising direction for future research.
Paper Structure (28 sections, 13 equations, 14 figures, 5 tables)

This paper contains 28 sections, 13 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Autoregressive generation, at each step the new token (orange) attends to all prior tokens (cyan). Without caching, keys and values for every past token would be recomputed from scratch at each step. The KV cache avoids this by storing and reusing them.
  • Figure 2: Data-flow of the KV cache within a single transformer layer. Input token $x_t$ fans into three projections; $K_t$ and $V_t$ are appended to their respective caches (teal); $Q_t$ attends over the full caches to produce output $o_t$. Cache size grows as $O(T)$ per head per layer.
  • Figure 3: KV cache memory as a function of context length for three LLaMA-2 model variants under fp16 precision. Dashed lines mark GPU VRAM limits; dotted lines mark model parameter memory. At 128K tokens, a 7B model's KV cache ($\approx$64 GB) exceeds the capacity of an A100 GPU, illustrating the memory bottleneck that motivates KV cache optimization. Values computed as $2 \times L \times H_{\text{kv}} \times d_h \times 2$ bytes per token; 70B uses GQA with 8 KV heads.
  • Figure 4: Causal self-attention weight matrix for "The apple tastes sweet." visualised with the Viridis colormap (dark purple = low, yellow = high). Gray cells are causally masked future tokens. Each row sums to 1 (post-softmax). Query "sweet" concentrates 65% of its attention on "apple", demonstrating that KV entries carry highly non-uniform importance, the core premise of attention-score-driven eviction methods such as H$_2$O and SnapKV.
  • Figure 5: Taxonomy of KV cache optimization techniques surveyed in this paper, organized into five major categories.
  • ...and 9 more figures