Table of Contents
Fetching ...

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Fanjiang Ye, Zhangke Li, Xinrui Zhong, Ethan Ma, Russell Chen, Kaijian Wang, Jingwei Zuo, Desen Sun, Ye Cao, Triston Cao, Myungjin Lee, Arvind Krishnamurthy, Yuke Wang

Abstract

Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Abstract

Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.

Paper Structure

This paper contains 20 sections, 8 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Serving 4 videos (V1--V4) and 3 images (I1--I3) on 4 GPUs. (a) FIFO: videos occupy all GPUs until completion; images must wait, causing HOL blocking and 5 deadline misses (SLO 2/7). (b) GenServe: when I1, I2 arrive, the scheduler preempts V3 and V4 at step boundaries, batches I1+I2 on GPU 3, and serves I3 on GPU 2. After images complete, V3 and V4 resume with SP degree switching (each scaling to 2 GPUs) to recover lost slack. All 7 requests meet their deadlines (SLO 7/7).
  • Figure 2: Overview of DiT inference process.
  • Figure 3: End-to-end latency of T2I and T2V workloads across batch sizes and resolutions. In T2I, batching yields noticeable savings over theoretical sequential execution at low resolutions, while T2V exhibits limited room for batching-based latency reduction even for low-resolution requests.
  • Figure 4: Head-of-line blocking under FCFS scheduling with workloads (70% video, 81 frames). Here Pois. means the Poisson arrival pattern. (a) SLO satisfaction drops sharply under bursty video arrivals, image SLO falls from 62% to 12% as videos monopolize all GPUs. (b) Image P99 queue wait time increases by $5\times$ from 7.3 s to 37.4 s, confirming that FCFS cannot protect image requests, causing HOL blocking.
  • Figure 5: The runtime of different stages in T2V across different resolutions and Sequence Parallelism (SP) degrees. DiT and VAE Decode latency of T2V across resolutions and SP degrees. DiT benefits from higher SP at high resolutions (up to 7.0$\times$ at 720p/81f) but shows diminishing returns at low resolutions. VAE Decode latency is unaffected by SP degree.
  • ...and 10 more figures