DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

Chaoyi Ruan; Yinhe Chen; Dongqi Tian; Yandong Shi; Yongji Wu; Jialin Li; Cheng Li

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, Cheng Li

TL;DR

DynaServe tackles the problem of delivering low tail latency ($P99$ TBT) and high goodput for LLM serving under dynamic workload skew. It introduces Adaptive Request Partition and Scheduling (APS), a two-level framework that uses micro-requests spanning arbitrary token boundaries and unified GPU instances, enabling both coarse-grained colocation and fine-grained disaggregation in a single system. A global scheduler selects near-optimal split points based on decoded length and current load, while local schedulers perform SLO-aware batching to maximize GPU utilization without violating the $100$ ms TBT SLO; runtime chunk-based KV transfers further reduce inter-instance overhead. Across real workloads on A100 clusters, DynaServe achieves up to $3.07\times$ serving capacity and up to $1.91\times$ and $1.61\times$ better goodput over colocation and disaggregation baselines, respectively, with robust performance under hybrid and real-time traffic. This approach unifies and extends existing paradigms, providing a practical, scalable solution for dynamic, disaggregated LLM serving with strong latency guarantees and high efficiency.

Abstract

LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15$\times$ to 3.07$\times$, improves goodput by up to 1.91$\times$ and 1.61$\times$, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

TL;DR

DynaServe tackles the problem of delivering low tail latency (

TBT) and high goodput for LLM serving under dynamic workload skew. It introduces Adaptive Request Partition and Scheduling (APS), a two-level framework that uses micro-requests spanning arbitrary token boundaries and unified GPU instances, enabling both coarse-grained colocation and fine-grained disaggregation in a single system. A global scheduler selects near-optimal split points based on decoded length and current load, while local schedulers perform SLO-aware batching to maximize GPU utilization without violating the

ms TBT SLO; runtime chunk-based KV transfers further reduce inter-instance overhead. Across real workloads on A100 clusters, DynaServe achieves up to

serving capacity and up to

and

better goodput over colocation and disaggregation baselines, respectively, with robust performance under hybrid and real-time traffic. This approach unifies and extends existing paradigms, providing a practical, scalable solution for dynamic, disaggregated LLM serving with strong latency guarantees and high efficiency.

Abstract

to 3.07

, improves goodput by up to 1.91

and 1.61

, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

TL;DR

Abstract

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)