DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, Cheng Li
TL;DR
DynaServe tackles the problem of delivering low tail latency ($P99$ TBT) and high goodput for LLM serving under dynamic workload skew. It introduces Adaptive Request Partition and Scheduling (APS), a two-level framework that uses micro-requests spanning arbitrary token boundaries and unified GPU instances, enabling both coarse-grained colocation and fine-grained disaggregation in a single system. A global scheduler selects near-optimal split points based on decoded length and current load, while local schedulers perform SLO-aware batching to maximize GPU utilization without violating the $100$ ms TBT SLO; runtime chunk-based KV transfers further reduce inter-instance overhead. Across real workloads on A100 clusters, DynaServe achieves up to $3.07\times$ serving capacity and up to $1.91\times$ and $1.61\times$ better goodput over colocation and disaggregation baselines, respectively, with robust performance under hybrid and real-time traffic. This approach unifies and extends existing paradigms, providing a practical, scalable solution for dynamic, disaggregated LLM serving with strong latency guarantees and high efficiency.
Abstract
LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15$\times$ to 3.07$\times$, improves goodput by up to 1.91$\times$ and 1.61$\times$, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.
