Table of Contents
Fetching ...

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Yue Zhu, Hao Yu, Chen Wang, Zhuoran Liu, Eun Kyung Lee

TL;DR

Prefix prefill for KVC in LLM inference imposes significant metadata and memory pressures as input length grows, impacting time-to-first-token ($TTFT$) and throughput. The paper employs trace-driven analysis of real-world LLM-serving workloads and benchmarks Redis, CHIME, and Sherman to characterize KVC access patterns, revealing high block reuse, a mix of sequential and random accesses, and latency gaps in traditional KV stores. The findings expose a mismatch between current metadata management approaches and prefix-prefill requirements, motivating a metadata-centric redesign. The authors outline ongoing work on a hierarchical KVC caching system with a reuse-optimized metadata cache, workload-aware indexing, and hotness-aware data placement to enable low-latency, scalable long-context inference.

Abstract

The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

TL;DR

Prefix prefill for KVC in LLM inference imposes significant metadata and memory pressures as input length grows, impacting time-to-first-token () and throughput. The paper employs trace-driven analysis of real-world LLM-serving workloads and benchmarks Redis, CHIME, and Sherman to characterize KVC access patterns, revealing high block reuse, a mix of sequential and random accesses, and latency gaps in traditional KV stores. The findings expose a mismatch between current metadata management approaches and prefix-prefill requirements, motivating a metadata-centric redesign. The authors outline ongoing work on a hierarchical KVC caching system with a reuse-optimized metadata cache, workload-aware indexing, and hotness-aware data placement to enable low-latency, scalable long-context inference.

Abstract

The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: Block Reusability over 1-Hour Trace
  • Figure 2: Sequential & Random Access Pattern in Requests
  • Figure 3: Normalized P99 Latency Based on Real Trace (Redis=1).