Table of Contents
Fetching ...

TridentServe: A Stage-level Serving System for Diffusion Pipelines

Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, Bin Cui

TL;DR

TridentServe tackles inefficiencies in diffusion-pipeline serving by introducing a dynamic, stage-level resource-allocation framework. It jointly optimizes model placement and per-request dispatch plans via a Dynamic Orchestrator and a Resource-Aware Dispatcher, guided by offline profiling and online monitoring, with an Adjust-on-Dispatch mechanism for seamless deployment. The system demonstrates substantial improvements in SLO attainment and latency (up to 2.5×–4.1× reductions) across diverse workloads and pipelines, while avoiding OOMs and adapting to dynamic workload patterns. This stage-aware approach significantly enhances serving efficiency and robustness for complex diffusion workloads, with practical implications for scalable, low-latency generative vision services.

Abstract

Diffusion pipelines, renowned for their powerful visual generation capabilities, have seen widespread adoption in generative vision tasks (e.g., text-to-image/video). These pipelines typically follow an encode--diffuse--decode three-stage architecture. Current serving systems deploy diffusion pipelines within a static, manual, and pipeline-level paradigm, allocating the same resources to every request and stage. However, through an in-depth analysis, we find that such a paradigm is inefficient due to the discrepancy in resource needs across the three stages of each request, as well as across different requests. Following the analysis, we propose the dynamic stage-level serving paradigm and develop TridentServe, a brand new diffusion serving system. TridentServe automatically, dynamically derives the placement plan (i.e., how each stage resides) for pipeline deployment and the dispatch plan (i.e., how the requests are routed) for request processing, co-optimizing the resource allocation for both model and requests. Extensive experiments show that TridentServe consistently improves SLO attainment and reduces average/P95 latencies by up to 2.5x and 3.6x/4.1x over existing works across a variety of workloads.

TridentServe: A Stage-level Serving System for Diffusion Pipelines

TL;DR

TridentServe tackles inefficiencies in diffusion-pipeline serving by introducing a dynamic, stage-level resource-allocation framework. It jointly optimizes model placement and per-request dispatch plans via a Dynamic Orchestrator and a Resource-Aware Dispatcher, guided by offline profiling and online monitoring, with an Adjust-on-Dispatch mechanism for seamless deployment. The system demonstrates substantial improvements in SLO attainment and latency (up to 2.5×–4.1× reductions) across diverse workloads and pipelines, while avoiding OOMs and adapting to dynamic workload patterns. This stage-aware approach significantly enhances serving efficiency and robustness for complex diffusion workloads, with practical implications for scalable, low-latency generative vision services.

Abstract

Diffusion pipelines, renowned for their powerful visual generation capabilities, have seen widespread adoption in generative vision tasks (e.g., text-to-image/video). These pipelines typically follow an encode--diffuse--decode three-stage architecture. Current serving systems deploy diffusion pipelines within a static, manual, and pipeline-level paradigm, allocating the same resources to every request and stage. However, through an in-depth analysis, we find that such a paradigm is inefficient due to the discrepancy in resource needs across the three stages of each request, as well as across different requests. Following the analysis, we propose the dynamic stage-level serving paradigm and develop TridentServe, a brand new diffusion serving system. TridentServe automatically, dynamically derives the placement plan (i.e., how each stage resides) for pipeline deployment and the dispatch plan (i.e., how the requests are routed) for request processing, co-optimizing the resource allocation for both model and requests. Extensive experiments show that TridentServe consistently improves SLO attainment and reduces average/P95 latencies by up to 2.5x and 3.6x/4.1x over existing works across a variety of workloads.

Paper Structure

This paper contains 42 sections, 1 theorem, 13 equations, 17 figures, 7 tables, 2 algorithms.

Key Result

proposition 1

SADP-Deadline is NP-complete. The hardness holds even under the restricted setting in which (i) $\mathcal{W}_E,\mathcal{W}_D,\mathcal{W}_C$ each contain only single-GPU teams, (ii) $Q_{r,ED}=Q_{r,DC}=0$ for all $r$, and (iii) placements $\pi_g$ forbid co-location of different stages on the same GPU

Figures (17)

  • Figure 1: The typical inference process of a Diffusion Pipeline.
  • Figure 2: Example of different serving methods. Assuming the GPU demands for the Diffuse stage of requests #1/2/3 are 3,2,1 respectively, and the demand for all Encode, Decode stages is 1.
  • Figure 3: Parallelism effects on Diffuse and Decode stages of Flux.1, tested on NVIDIA L20.
  • Figure 4: Model replica demands to achieve a balanced processing speed, tested on NVIDIA L20. Light/Medium/Heavy denote workloads defined in §\ref{['subsec: experiment setup']}
  • Figure 5: Overview of TridentServe.
  • ...and 12 more figures

Theorems & Definitions (1)

  • proposition 1: NP-completeness