Table of Contents
Fetching ...

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving

Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen

TL;DR

KunServe addresses latency spikes in LLM serving caused by GPU memory pressure on KVCache by shifting to parameter-centric memory management. It derives online drop plans that remove replicated model parameters to free memory, coordinates KVCache exchange, and uses a lookahead batch formulation to enable efficient pipelined execution after drops. Key contributions include an $O(N\log N)$ greedy drop algorithm, a GPU-virtual-memory-based KVCache extension, a coordinated KVCache exchange mechanism, and a cost-model-guided scheduler that substantially reduces tail latency (up to $72.2\times$ in TTFT) while improving SLO adherence on realistic traces and models. The work demonstrates practical viability and provides open-source release of KunServe for broader adoption.

Abstract

Serving LLMs with a cluster of GPUs is common nowadays, where the serving system must meet strict latency SLOs required by applications. However, the stateful nature of LLM serving requires maintaining huge states (i.e., KVCache) in limited GPU memory. Under spikes in real-world workloads, GPU memory can be easily throttled, leading to orders of magnitude higher response latency due to queuing introduced by waiting for KVCache to be reclaimed. Prior KVCache-centric approaches handle load throttling by dropping, migrating, or swapping KVCache. These methods fail to release sufficient memory quickly with requests still queued. This paper proposes the first parameter-centric approach to handling throttling by selectively dropping replicated parameters to instantly free memory for requests, based on an unnoticed observation that model parameters are commonly replicated across GPUs for serving LLMs. With additional memory, all requests can be served with a larger batch without queuing. To make the parameter-centric approach correct and efficient, we cooperatively execute requests on GPUs with a complete copy of parameters using pipeline parallelism, and derive an appropriate drop plan without unnecessary cooperation. We also design techniques to minimize the performance overhead due to pipeline parallelism with the execution patterns of requests under drop. Evaluations show that {\sys} reduces the tail TTFT of requests under throttling by up to 72.2 times compared to the state-of-the-art systems including Llumnix, vLLM and InferCept.

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving

TL;DR

KunServe addresses latency spikes in LLM serving caused by GPU memory pressure on KVCache by shifting to parameter-centric memory management. It derives online drop plans that remove replicated model parameters to free memory, coordinates KVCache exchange, and uses a lookahead batch formulation to enable efficient pipelined execution after drops. Key contributions include an greedy drop algorithm, a GPU-virtual-memory-based KVCache extension, a coordinated KVCache exchange mechanism, and a cost-model-guided scheduler that substantially reduces tail latency (up to in TTFT) while improving SLO adherence on realistic traces and models. The work demonstrates practical viability and provides open-source release of KunServe for broader adoption.

Abstract

Serving LLMs with a cluster of GPUs is common nowadays, where the serving system must meet strict latency SLOs required by applications. However, the stateful nature of LLM serving requires maintaining huge states (i.e., KVCache) in limited GPU memory. Under spikes in real-world workloads, GPU memory can be easily throttled, leading to orders of magnitude higher response latency due to queuing introduced by waiting for KVCache to be reclaimed. Prior KVCache-centric approaches handle load throttling by dropping, migrating, or swapping KVCache. These methods fail to release sufficient memory quickly with requests still queued. This paper proposes the first parameter-centric approach to handling throttling by selectively dropping replicated parameters to instantly free memory for requests, based on an unnoticed observation that model parameters are commonly replicated across GPUs for serving LLMs. With additional memory, all requests can be served with a larger batch without queuing. To make the parameter-centric approach correct and efficient, we cooperatively execute requests on GPUs with a complete copy of parameters using pipeline parallelism, and derive an appropriate drop plan without unnecessary cooperation. We also design techniques to minimize the performance overhead due to pipeline parallelism with the execution patterns of requests under drop. Evaluations show that {\sys} reduces the tail TTFT of requests under throttling by up to 72.2 times compared to the state-of-the-art systems including Llumnix, vLLM and InferCept.

Paper Structure

This paper contains 22 sections, 2 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: An illustration of a typical LLM serving scenario: (a) the model is deployed on different servers with model parallelism and prefill and decode requests are processed in a batched way. exe. is abbreviation for execution.
  • Figure 2: Analysis of TTFT increases due to GPU memory overloading (abbreviated as “Over.” in figure). (a) The incoming request rate of BurstGPT trace burstgpt. (b) KVCache memory demand on vLLM vllm and (c)--(e) requests TTFT of existing solutions (§\ref{['sec:state-of-the-art']}).
  • Figure 3: (a)---(c) Existing methodologies to address memory overloading of KVCache. (d) How KunServe tackles this issue via parameter dropping (❶) and remapping memory to enlarge KVCache region (❷).
  • Figure 4: System overview of KunServe.
  • Figure 5: A comparison of the latency of different parallelism on BurstGPT dataset. All setups are evaluated with 8 GPUs.
  • ...and 12 more figures