Table of Contents
Fetching ...

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

Junhan Liao, Minxian Xu, Wanyi Zheng, Yan Wang, Kejiang Ye, Rajkumar Buyya, Chengzhong Xu

TL;DR

This work addresses inefficiencies in PD-Disaggregation for LLM inference caused by heterogeneous and time-varying workloads. It introduces DOPD, a framework that (i) analytically derives an optimal P/D ratio based on workload forecasts and device constraints, (ii) uses length-aware request scheduling to mitigate mixed-length interference, and (iii) dynamically resizes P- and D-instances to maintain producer–consumer balance. Through extensive experiments on real production traces and multiple LLMs, DOPD achieves up to 1.5x goodput, up to 67.5% faster P90 TTFT, and near-perfect SLO attainment, outperforming static and dynamic baselines. The approach provides a scalable, low-overhead mechanism for deploying disaggregated LLM inference in industrial environments while conserving GPU resources and meeting stringent SLAs.

Abstract

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.

DOPO: A Dynamic PD-Disaggregation Architecture for Maximizing Goodput in LLM Inference Serving

TL;DR

This work addresses inefficiencies in PD-Disaggregation for LLM inference caused by heterogeneous and time-varying workloads. It introduces DOPD, a framework that (i) analytically derives an optimal P/D ratio based on workload forecasts and device constraints, (ii) uses length-aware request scheduling to mitigate mixed-length interference, and (iii) dynamically resizes P- and D-instances to maintain producer–consumer balance. Through extensive experiments on real production traces and multiple LLMs, DOPD achieves up to 1.5x goodput, up to 67.5% faster P90 TTFT, and near-perfect SLO attainment, outperforming static and dynamic baselines. The approach provides a scalable, low-overhead mechanism for deploying disaggregated LLM inference in industrial environments while conserving GPU resources and meeting stringent SLAs.

Abstract

To meet strict Service-Level Objectives (SLOs),contemporary Large Language Models (LLMs) decouple the prefill and decoding stages and place them on separate GPUs to mitigate the distinct bottlenecks inherent to each phase. However, the heterogeneity of LLM workloads causes producerconsumer imbalance between the two instance types in such disaggregated architecture. To address this problem, we propose DOPD (Dynamic Optimal Prefill/Decoding), a dynamic LLM inference system that adjusts instance allocations to achieve an optimal prefill-to-decoding (P/D) ratio based on real-time load monitoring. Combined with an appropriate request-scheduling policy, DOPD effectively resolves imbalances between prefill and decoding instances and mitigates resource allocation mismatches due to mixed-length requests under high concurrency. Experimental evaluations show that, compared with vLLM and DistServe (representative aggregation-based and disaggregationbased approaches), DOPD improves overall system goodput by up to 1.5X, decreases P90 time-to-first-token (TTFT) by up to 67.5%, and decreases P90 time-per-output-token (TPOT) by up to 22.8%. Furthermore, our dynamic P/D adjustment technique performs proactive reconfiguration based on historical load, achieving over 99% SLOs attainment while using less additional resources.

Paper Structure

This paper contains 36 sections, 20 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Prefill-decoding disaggregation diagram.
  • Figure 2: The input/output length of request tokens in Microsoft Azure traces dataset.
  • Figure 3: The experiment of different P/D ratio. The "2P (TP=1) / 2D (TP=2)" in picture represents a PD-Disaggregation deployment comprising 2 P-instances each deployed on a single GPU (tensor parallel size $=1$) and 2 D-instances each deployed across two GPUs (tensor parallel size $=2$). Some of the curves in the picture are incomplete because subsequent experiments are meaningless. For example, the configuration corresponding to the red curve cannot cope with a load exceeding a request concurrency of 50.
  • Figure 4: Comparison of prefill request scheduling solutions. All blocks represent the prefill inference process for sequences, and where the blue block represents the prefill inference for that sequence performed in a P-instance, and the yellow block represents the prefill inference performed in a D-instance. The solid block represents processing, and the dashed block represents unprocessed at the current step and still pending processing.
  • Figure 5: DOPD system architecture.
  • ...and 5 more figures