Table of Contents
Fetching ...

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Murali Annavaram

TL;DR

DuetServe tackles the challenge of sustaining high throughput while meeting strict TBT SLOs in LLM serving by unifying aggregated and disaggregated execution within a single GPU. It introduces an attention-aware roofline model to forecast iteration latency, a dynamic SM-partitioning optimizer to isolate prefill and decode only when contention threatens the TBT bound $t_{\text{TBT}}$, and an interruption-free execution engine with look-ahead decode and CUDA Graph replay. The approach yields up to 1.3× total throughput gains while keeping TBT low across real-world traces and two model sizes (Qwen3-8B and Qwen3-14B). Practically, DuetServe enables higher GPU utilization without the latency penalties of full disaggregation, delivering scalable, low-latency LLM serving in dynamic workloads.

Abstract

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

TL;DR

DuetServe tackles the challenge of sustaining high throughput while meeting strict TBT SLOs in LLM serving by unifying aggregated and disaggregated execution within a single GPU. It introduces an attention-aware roofline model to forecast iteration latency, a dynamic SM-partitioning optimizer to isolate prefill and decode only when contention threatens the TBT bound , and an interruption-free execution engine with look-ahead decode and CUDA Graph replay. The approach yields up to 1.3× total throughput gains while keeping TBT low across real-world traces and two model sizes (Qwen3-8B and Qwen3-14B). Practically, DuetServe enables higher GPU utilization without the latency penalties of full disaggregation, delivering scalable, low-latency LLM serving in dynamic workloads.

Abstract

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

Paper Structure

This paper contains 31 sections, 9 equations, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: Overview of DuetServe.
  • Figure 2: (a) Linear layer benchmarking shows A100 and H100 saturate near 2K and 8K tokens. (b) Prefill latency under an 8K budget violates TBT SLOs despite full utilization. (c) Decode latency rises with longer contexts as KV cache grows even under the same budget.
  • Figure 3: Performance comparison between PD aggregated and disaggregated systems across varying QPS.
  • Figure 4: (a) Profiled HBM bandwidth and FLOPs versus active TPCs. (b–c) Resource utilization during prefill and decode phases.
  • Figure 5: (Top) Conventional decoding incurs CPU–GPU stalls from per-step synchronization. (Bottom) Interruption-free kernel dispatching by scheduling multiple decoding steps in advance.
  • ...and 5 more figures