Table of Contents
Fetching ...

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan

TL;DR

LayerKV tackles the TTFT surge in long-context LLM serving by recognizing queuing delays from GPU KV cache block contention as the primary bottleneck. It introduces a lightweight plug-in with layer-wise KV block allocation, management, and offloading, paired with an SLO-aware scheduler to optimize TTFT without sacrificing TPOT or QPS. Across 7B–70B models and diverse GPU configurations, LayerKV achieves up to 69× reductions in mean TTFT and lowers SLO violation rates by up to 28.7%, improving user experience in real-time interactions. The method is designed to be compatible with existing serving systems and parallelism strategies, offering a practical path to scalable, low-latency LLM inference without additional hardware.

Abstract

The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%, significantly enhancing the user experience.

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

TL;DR

LayerKV tackles the TTFT surge in long-context LLM serving by recognizing queuing delays from GPU KV cache block contention as the primary bottleneck. It introduces a lightweight plug-in with layer-wise KV block allocation, management, and offloading, paired with an SLO-aware scheduler to optimize TTFT without sacrificing TPOT or QPS. Across 7B–70B models and diverse GPU configurations, LayerKV achieves up to 69× reductions in mean TTFT and lowers SLO violation rates by up to 28.7%, improving user experience in real-time interactions. The method is designed to be compatible with existing serving systems and parallelism strategies, offering a practical path to scalable, low-latency LLM inference without additional hardware.

Abstract

The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%, significantly enhancing the user experience.
Paper Structure (24 sections, 5 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 5 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: LLaMA-2-7B DBLP:llama2 on a single L20 GPU with 48GB memory at a request arrival rate of 1 req/s. All latency measurements represent the average across 100 requests.
  • Figure 2: The surge in queuing delays is caused by the inability to process long prompts due to insufficient KV blocks.
  • Figure 3: LayerKV System Overview
  • Figure 4: Performance Comparison of LayerKV and vLLM Under Varying Context Lengths.
  • Figure 5: Performance Comparison of LayerKV and vLLM Under Varying Degree of Parallelism.
  • ...and 3 more figures