Table of Contents
Fetching ...

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, Haibo Chen

TL;DR

This work addresses the gap in understanding real-world KV$ caching for LLM serving by analyzing production traces from a leading cloud provider, revealing that KV$ reuse is common but highly skewed and that temporal and spatial locality vary across workload categories. It proposes a workload-aware eviction policy that uses per-workload reuse distributions and lifespan to prioritize KV$ blocks, integrating the policy into vLLM and demonstrating significant gains in cache hit rates and tail latency reductions on real traces. The findings show that single-turn and multi-turn requests both drive KV$ hits, that KV$ lifespans are short and predictable, and that small GPU-based caches can suffice for many API-dominated workloads, with larger caches required for certain model families like MHA. Overall, the paper offers a practical, data-driven approach to KV$ cache design that improves serving throughput and latency in large-scale LLM deployments.

Abstract

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV\$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

TL;DR

This work addresses the gap in understanding real-world KV reuse is common but highly skewed and that temporal and spatial locality vary across workload categories. It proposes a workload-aware eviction policy that uses per-workload reuse distributions and lifespan to prioritize KV hits, that KV cache design that improves serving throughput and latency in large-scale LLM deployments.

Abstract

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

Paper Structure

This paper contains 16 sections, 2 equations, 29 figures, 2 tables.

Figures (29)

  • Figure 1: An illustration showing how an LLM processes requests.
  • Figure 2: An illustration of: ❶ how KV$ from a prefill request (Req#0) can be reused by the decoding of Req#0, and ❷ how KV$ can be reused for the prefill of a future request (Req#1).
  • Figure 3: An example of the collected trace record.
  • Figure 4: An analysis of the ideal cache hit ratio of the KV$ cache under real-world LLM serving workloads within a day. The reported accessed and hit block numbers are normalized (norm.).
  • Figure 5: Workload types and multi-turn ratio of requests in our collected traces.
  • ...and 24 more figures