Table of Contents
Fetching ...

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

Chaoyi Ruan, Yinhe Chen, Dongqi Tian, Yandong Shi, Yongji Wu, Jialin Li, Cheng Li

TL;DR

DynaServe tackles the problem of delivering low tail latency ($P99$ TBT) and high goodput for LLM serving under dynamic workload skew. It introduces Adaptive Request Partition and Scheduling (APS), a two-level framework that uses micro-requests spanning arbitrary token boundaries and unified GPU instances, enabling both coarse-grained colocation and fine-grained disaggregation in a single system. A global scheduler selects near-optimal split points based on decoded length and current load, while local schedulers perform SLO-aware batching to maximize GPU utilization without violating the $100$ ms TBT SLO; runtime chunk-based KV transfers further reduce inter-instance overhead. Across real workloads on A100 clusters, DynaServe achieves up to $3.07\times$ serving capacity and up to $1.91\times$ and $1.61\times$ better goodput over colocation and disaggregation baselines, respectively, with robust performance under hybrid and real-time traffic. This approach unifies and extends existing paradigms, providing a practical, scalable solution for dynamic, disaggregated LLM serving with strong latency guarantees and high efficiency.

Abstract

LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15$\times$ to 3.07$\times$, improves goodput by up to 1.91$\times$ and 1.61$\times$, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

TL;DR

DynaServe tackles the problem of delivering low tail latency ( TBT) and high goodput for LLM serving under dynamic workload skew. It introduces Adaptive Request Partition and Scheduling (APS), a two-level framework that uses micro-requests spanning arbitrary token boundaries and unified GPU instances, enabling both coarse-grained colocation and fine-grained disaggregation in a single system. A global scheduler selects near-optimal split points based on decoded length and current load, while local schedulers perform SLO-aware batching to maximize GPU utilization without violating the ms TBT SLO; runtime chunk-based KV transfers further reduce inter-instance overhead. Across real workloads on A100 clusters, DynaServe achieves up to serving capacity and up to and better goodput over colocation and disaggregation baselines, respectively, with robust performance under hybrid and real-time traffic. This approach unifies and extends existing paradigms, providing a practical, scalable solution for dynamic, disaggregated LLM serving with strong latency guarantees and high efficiency.

Abstract

LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15 to 3.07, improves goodput by up to 1.91 and 1.61, and improves the performance by up to 60\% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.

Paper Structure

This paper contains 23 sections, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: Throughput vs. SLO attainment across serving architectures. PD colocation with chunked prefill reaches high throughput but violates the latency SLO. PD disaggregation satisfies the SLO but under-utilizes GPUs. DynaServe balances the two, advancing the frontier toward the top-right with higher goodput at guaranteed latency.
  • Figure 2: Partition and scheduling strategies in LLM serving. (a) PD colocation, applying chunked prefill to reduce interference. (b) coarse-grained PD disaggregation, which avoids interference but leads to GPU underutilization.
  • Figure 3: Prompt and output token lengths distribution. The blue line shows prompt length, and the orange line shows output length. The green line indicates the balanced output length, where decode time equals prefill time.
  • Figure 4: The overall architecture of DynaServe features unified GPU instances executing partitioned micro-requests, guided by a two-level APS mechanism for improved SLO attainment and resource utilization. Orange and blue colors denote the prefill and decode stages of each request, respectively.
  • Figure 5: Throughput of Qwen2.5-32B on A100 under different split positions. Each request has 1024-token prompt and output. Position 1024 corresponds to PD disaggregation, while position 1358 represents the optimal split found by dynamic partitioning.
  • ...and 6 more figures