Table of Contents
Fetching ...

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia

TL;DR

Nexus tackles latency-sensitive LLM serving by enabling proactive intra-engine disaggregation of prefill and decode within a single GPU. It introduces a lightweight latency cost model, a greedy SM-partitioning algorithm, hysteresis-based stability, and phase-specific schedulers to adapt resource allocation to dynamic workloads. Across diverse models and workloads, Nexus achieves up to 2.2x throughput and up to 20x lower TTFT, while also delivering significantly reduced TBT compared to baselines like vLLM and SGLang. The approach delivers the benefits of disaggregation without cross-GPU transfer costs, preserving high GPU utilization and robustness under varying traffic and prompt structures.

Abstract

Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond a saturation point, more allocation yields minimal latency benefit. Second, we observe that memory bandwidth contention becomes a critical bottleneck. These insights motivate a design that dynamically partitions GPU resources across prefill and decode phases, while jointly considering compute capacity, memory footprint, and bandwidth contention. Evaluated on diverse LLMs and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SGLang by up to 2x; and matches or exceeds disaggregated vLLM.

Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving

TL;DR

Nexus tackles latency-sensitive LLM serving by enabling proactive intra-engine disaggregation of prefill and decode within a single GPU. It introduces a lightweight latency cost model, a greedy SM-partitioning algorithm, hysteresis-based stability, and phase-specific schedulers to adapt resource allocation to dynamic workloads. Across diverse models and workloads, Nexus achieves up to 2.2x throughput and up to 20x lower TTFT, while also delivering significantly reduced TBT compared to baselines like vLLM and SGLang. The approach delivers the benefits of disaggregation without cross-GPU transfer costs, preserving high GPU utilization and robustness under varying traffic and prompt structures.

Abstract

Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns -- beyond a saturation point, more allocation yields minimal latency benefit. Second, we observe that memory bandwidth contention becomes a critical bottleneck. These insights motivate a design that dynamically partitions GPU resources across prefill and decode phases, while jointly considering compute capacity, memory footprint, and bandwidth contention. Evaluated on diverse LLMs and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SGLang by up to 2x; and matches or exceeds disaggregated vLLM.

Paper Structure

This paper contains 33 sections, 12 equations, 13 figures, 1 table, 2 algorithms.

Figures (13)

  • Figure 1: Design evolution of LLM inference systems. Comparison between monolithic, disaggregated, and intra-engine disaggregated designs. Ap is the prefill phase of request A; Bd, Cd, and Dd are the decode phases of requests B, C, and D.
  • Figure 2: Inference process of transformer-based LLMs. Red boxes indicate compute-bound operations (KQV Linear, Prefill Attention, Attention Linear, and FFN Layer), while the orange box (Attention) represents a memory-bound operation. Auxiliary components such as LayerNorm are omitted for clarity.
  • Figure 3: Simplified GPU execution model. Modern GPUs share a global kernel queue, with SMs (streaming multiprocessors) dynamically fetching kernels to execute. Concurrently executing kernels compete for shared memory bandwidth.
  • Figure 4: Latency impact of mixed prefill–decode batches. (a) Prefill-only and decode-only batches have predictable latency, but mixed batches cause 8×–10× slowdown due to interference. (b) Kernel-level profiling reveals that even lightweight decode kernels experience inflated runtimes when co-executed with prefill. This highlights the fine-grained contention caused by chunked batching.
  • Figure 5: Diminishing returns in prefill and decode with increasing SM allocation. (a) End-to-end latency for prefill and decode flattens well before full SM usage. (b) Prefill kernels (e.g., FFN, KQV, attention linear) show varied sensitivity to SM scaling, with FFN benefiting the most. (c) Decode kernels saturate quickly, confirming that decode is memory-bound and gains little from additional compute.
  • ...and 8 more figures