Table of Contents
Fetching ...

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

Weihao Cui, Yukang Chen, Han Zhao, Ziyi Xu, Quan Chen, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Minyi Guo

TL;DR

Drift presents PD multiplexing to resolve the long-standing trade-off between SLO guarantees and high throughput in LLM serving. By enabling in-place compute partitioning with phase decoupling on shared GPUs, Drift preserves KV-cache locality while coordinating prefill and decode phases through adaptive gang scheduling and contention-free modeling. The offline-online framework, combined with a SLO-aware dispatcher, yields substantial throughput gains (average 5.1x, up to 17.5x) across real-world and synthetic workloads while consistently meeting SLO targets. This approach eliminates the need for heavy KV-cache transfers or brittle chunking, offering a robust, scalable solution for complex, multi-turn LLM services.

Abstract

Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average $5.1\times$ throughput improvement (up to $17.5\times$) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

TL;DR

Drift presents PD multiplexing to resolve the long-standing trade-off between SLO guarantees and high throughput in LLM serving. By enabling in-place compute partitioning with phase decoupling on shared GPUs, Drift preserves KV-cache locality while coordinating prefill and decode phases through adaptive gang scheduling and contention-free modeling. The offline-online framework, combined with a SLO-aware dispatcher, yields substantial throughput gains (average 5.1x, up to 17.5x) across real-world and synthetic workloads while consistently meeting SLO targets. This approach eliminates the need for heavy KV-cache transfers or brittle chunking, offering a robust, scalable solution for complex, multi-turn LLM services.

Abstract

Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average throughput improvement (up to ) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.

Paper Structure

This paper contains 52 sections, 1 equation, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: Examples of processing multi-turn conversations in modern LLMs with out-of-place partition and in-place sharing.
  • Figure 2: Main architecture of most LLMs.
  • Figure 3: Required compute resources and accessed memory for processing different inference phases under SLO constraints with varying reused context lengths. In the prefill phase, the batch size is fixed at $1$, the new context length is set to $2K$, and TTFT is set to $400ms$. In the decode phase, the batch size is fixed at $32$, and TBT is set to $80ms$. These values are commonly seen in online serving.
  • Figure 4: Execution timeline comparison between chunking-based solution and Drift.
  • Figure 5: Sweet spot of the token budget in chunk-based solutions. The decode phase uses a fixed batch size of 32, with each request having a reused context length of 1K tokens.
  • ...and 8 more figures