MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, Yue Cheng
TL;DR
MorphServe addresses the problem of elastic LLM serving under dynamic workloads by introducing runtime morphological adaptation with two asynchronous mechanisms: LayerSwapper for token-level quantized layer swapping and KVResizer for elastic KV cache resizing, both guided by a Layer Importance Score ($LIS$). The approach enables state-preserving transitions that minimize overhead and maintain decoding progress, achieving significant reductions in SLO violations ($ ext{avg} o 92.45 ext{ extperthousand}$) and substantial improvements in tail latency ($2.2$–$3.9\times$) while preserving generation quality. It demonstrates practical benefits across Vicuna, Llama, CodeLlama, and related models, outperforming static quantization in memory utilization (up to $29.29 ext{ extperthousand}$) and demonstrating generalizable layer importance profiles. MorphServe offers a scalable, online solution that navigates the accuracy–latency Pareto frontier under bursty workloads, enabling elastic deployment of LLMs in real-world settings.
Abstract
Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.
