Table of Contents
Fetching ...

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, Yue Cheng

TL;DR

MorphServe addresses the problem of elastic LLM serving under dynamic workloads by introducing runtime morphological adaptation with two asynchronous mechanisms: LayerSwapper for token-level quantized layer swapping and KVResizer for elastic KV cache resizing, both guided by a Layer Importance Score ($LIS$). The approach enables state-preserving transitions that minimize overhead and maintain decoding progress, achieving significant reductions in SLO violations ($ ext{avg} o 92.45 ext{ extperthousand}$) and substantial improvements in tail latency ($2.2$–$3.9\times$) while preserving generation quality. It demonstrates practical benefits across Vicuna, Llama, CodeLlama, and related models, outperforming static quantization in memory utilization (up to $29.29 ext{ extperthousand}$) and demonstrating generalizable layer importance profiles. MorphServe offers a scalable, online solution that navigates the accuracy–latency Pareto frontier under bursty workloads, enabling elastic deployment of LLMs in real-world settings.

Abstract

Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

TL;DR

MorphServe addresses the problem of elastic LLM serving under dynamic workloads by introducing runtime morphological adaptation with two asynchronous mechanisms: LayerSwapper for token-level quantized layer swapping and KVResizer for elastic KV cache resizing, both guided by a Layer Importance Score (). The approach enables state-preserving transitions that minimize overhead and maintain decoding progress, achieving significant reductions in SLO violations () and substantial improvements in tail latency () while preserving generation quality. It demonstrates practical benefits across Vicuna, Llama, CodeLlama, and related models, outperforming static quantization in memory utilization (up to ) and demonstrating generalizable layer importance profiles. MorphServe offers a scalable, online solution that navigates the accuracy–latency Pareto frontier under bursty workloads, enabling elastic deployment of LLMs in real-world settings.

Abstract

Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.

Paper Structure

This paper contains 13 sections, 4 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Motivation for dynamic adaptation design in LLM serving. (a) Real-world LLM workloads are highly dynamic and bursty in request and token volume. (b) Full-precision serving suffers TTFT spikes and SLO violations when workload exceeds the saturation point. (c) Statically quantized model causes constant accuracy degradation even during low-load periods when it is possible to serve full-precision models. (d) MorphServe dynamically adapts to resource pressure and consistently achieves an optimal balance between SLO compliance and accuracy.
  • Figure 2: MorphServe dynamic adaptation workflow. Incoming requests (1) and real-time telemetry from workers (2) are aggregated by the Serving Monitor and sent to the Request Dispatcher (3). The Dispatcher routes requests to workers (4) and forwards runtime metrics to the Morphing Controller (5), which detects resource pressure and issues adaptation commands (5'). Responses (6) are returned to users, with only a small portion of tokens (in green) generated by mixed-precision layers.
  • Figure 3: Synergy of dynamic layer swapping and elastic KVC resizing. Figure (a)–(d) illustrate the model state morphing process: starting from full-precision serving (a), selected layers (b) are replaced with quantized versions (c) without disrupting the inference computation. This process leads to mixed-precision layer serving (d). Figure (e) shows the detailed decoder layer swapping mechanism. Figure (f) demonstrates KVC block management under KVResizer, where newly vacant memory blocks are dynamically reallocated to KVC or deallocated from KVC based on real-time workload shifts. KVResizer reduces the request preemption rate for decoding and incoming request queueing time for prefilling.
  • Figure 4: MorphServe provides the best latency–accuracy tradeoff across four models and two traces, with four datasets. MorphServe in accuracy mode (dark green) reduces P95 TTFT by $2.2\times$–$3.9\times$ compared to full-precision serving while maintaining comparable generation quality. In performance mode (light green), MorphServe consistently outperforms INT4 quantized models in output quality with no additional latency overhead.
  • Figure 5: MorphServe dynamically adapts KVC block capacity to workload fluctuations. The red line indicates the KV cache capacity under full-precision serving. MorphServe (green) elastically attaches new KV blocks during peak loads, pushing the saturation boundary and preventing request preemption or KVC swapping in the full-precision baseline (blue). Static quantization (orange) underutilizes memory due to its fixed configuration, even when resource headroom is available.
  • ...and 2 more figures