Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
Yue Zhu, Hao Yu, Chen Wang, Zhuoran Liu, Eun Kyung Lee
TL;DR
Prefix prefill for KVC in LLM inference imposes significant metadata and memory pressures as input length grows, impacting time-to-first-token ($TTFT$) and throughput. The paper employs trace-driven analysis of real-world LLM-serving workloads and benchmarks Redis, CHIME, and Sherman to characterize KVC access patterns, revealing high block reuse, a mix of sequential and random accesses, and latency gaps in traditional KV stores. The findings expose a mismatch between current metadata management approaches and prefix-prefill requirements, motivating a metadata-centric redesign. The authors outline ongoing work on a hierarchical KVC caching system with a reuse-optimized metadata cache, workload-aware indexing, and hotness-aware data placement to enable low-latency, scalable long-context inference.
Abstract
The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
