Table of Contents
Fetching ...

On 10x Better Scalability: KV Stores Scale Up KV Cache

Weiping Yu, Ye Jiarui, He Mengke, Junfeng Liu, Siqiang Luo

TL;DR

This work tackles the scalability limitations of disk-based KV caches in large-scale LLM serving, where file-per-object storage causes metadata, I/O, and locality bottlenecks. It proposes SGLang-LSM, a database-inspired KV cache system that leverages a prefix-preserving storage engine, an adaptive controller, and runtime services, built atop an LSM-tree foundation to maintain prefix semantics and support batch operations. The approach yields substantial gains, including cache-hit improvements up to 143% and TTFT reductions up to 24% in dynamic workloads, across multiple models and prompt lengths. By demonstrating the first systematic application of database storage techniques to large-scale KV cache management, SGLang-LSM offers a scalable, drop-in replacement for SGLang that can adapt to workload shifts and multi-tenant production deployments.

Abstract

Large language models (LLMs) rely on Key-Value (KV) cache to reduce time-to-first-token (TTFT) latency, but existing disk-based KV cache systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge-tree (LSM-tree) architectures for scalable KV cache management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-preserving storage engine that maintains token sequence locality while efficiently storing large KV cache tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch operations and automatic resource management for production deployment. Evaluation on large-scale dynamic workloads demonstrates that SGLANG-LSM significantly improves cache hits by up to 143% and reduces TTFT by up to 24% compared to state-of-the-art systems, representing the first systematic application of database storage architectures to large-scale LLM cache management.

On 10x Better Scalability: KV Stores Scale Up KV Cache

TL;DR

This work tackles the scalability limitations of disk-based KV caches in large-scale LLM serving, where file-per-object storage causes metadata, I/O, and locality bottlenecks. It proposes SGLang-LSM, a database-inspired KV cache system that leverages a prefix-preserving storage engine, an adaptive controller, and runtime services, built atop an LSM-tree foundation to maintain prefix semantics and support batch operations. The approach yields substantial gains, including cache-hit improvements up to 143% and TTFT reductions up to 24% in dynamic workloads, across multiple models and prompt lengths. By demonstrating the first systematic application of database storage techniques to large-scale KV cache management, SGLang-LSM offers a scalable, drop-in replacement for SGLang that can adapt to workload shifts and multi-tenant production deployments.

Abstract

Large language models (LLMs) rely on Key-Value (KV) cache to reduce time-to-first-token (TTFT) latency, but existing disk-based KV cache systems using file-per-object layouts suffer from severe scalability bottlenecks due to file system metadata overhead, I/O inefficiency, and poor spatial locality. This paper presents SGLANG-LSM, a database-inspired system that leverages Log-Structured Merge-tree (LSM-tree) architectures for scalable KV cache management. SGLANG-LSM implements a layered system design with three coordinated components: (1) a prefix-preserving storage engine that maintains token sequence locality while efficiently storing large KV cache tensors through key-value separation, (2) an adaptive controller that dynamically optimizes LSM-tree configurations based on shifting workload characteristics, and (3) runtime services including batch operations and automatic resource management for production deployment. Evaluation on large-scale dynamic workloads demonstrates that SGLANG-LSM significantly improves cache hits by up to 143% and reduces TTFT by up to 24% compared to state-of-the-art systems, representing the first systematic application of database storage architectures to large-scale LLM cache management.

Paper Structure

This paper contains 21 sections, 7 figures.

Figures (7)

  • Figure 1: SGLang-LSM.
  • Figure 2: An overview of an LSM-tree.
  • Figure 3: SGLang-LSM system.
  • Figure 4: Overall performance of SGLang-LSM with different prompt length.
  • Figure 5: (a)(b) Different LLMs case study. (c) Dynamic compaction case study
  • ...and 2 more figures