KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
Rongxin Cheng, Yuxin Lai, Xingda Wei, Rong Chen, Haibo Chen
TL;DR
KunServe addresses latency spikes in LLM serving caused by GPU memory pressure on KVCache by shifting to parameter-centric memory management. It derives online drop plans that remove replicated model parameters to free memory, coordinates KVCache exchange, and uses a lookahead batch formulation to enable efficient pipelined execution after drops. Key contributions include an $O(N\log N)$ greedy drop algorithm, a GPU-virtual-memory-based KVCache extension, a coordinated KVCache exchange mechanism, and a cost-model-guided scheduler that substantially reduces tail latency (up to $72.2\times$ in TTFT) while improving SLO adherence on realistic traces and models. The work demonstrates practical viability and provides open-source release of KunServe for broader adoption.
Abstract
Serving LLMs with a cluster of GPUs is common nowadays, where the serving system must meet strict latency SLOs required by applications. However, the stateful nature of LLM serving requires maintaining huge states (i.e., KVCache) in limited GPU memory. Under spikes in real-world workloads, GPU memory can be easily throttled, leading to orders of magnitude higher response latency due to queuing introduced by waiting for KVCache to be reclaimed. Prior KVCache-centric approaches handle load throttling by dropping, migrating, or swapping KVCache. These methods fail to release sufficient memory quickly with requests still queued. This paper proposes the first parameter-centric approach to handling throttling by selectively dropping replicated parameters to instantly free memory for requests, based on an unnoticed observation that model parameters are commonly replicated across GPUs for serving LLMs. With additional memory, all requests can be served with a larger batch without queuing. To make the parameter-centric approach correct and efficient, we cooperatively execute requests on GPUs with a complete copy of parameters using pipeline parallelism, and derive an appropriate drop plan without unnecessary cooperation. We also design techniques to minimize the performance overhead due to pipeline parallelism with the execution patterns of requests under drop. Evaluations show that {\sys} reduces the tail TTFT of requests under throttling by up to 72.2 times compared to the state-of-the-art systems including Llumnix, vLLM and InferCept.
