Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Euisoo Jung; Byunghyun Kim; Hyunjin Kim; Seonghye Cho; Jae-Gil Lee

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Euisoo Jung, Byunghyun Kim, Hyunjin Kim, Seonghye Cho, Jae-Gil Lee

TL;DR

A hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models is proposed.

Abstract

Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

TL;DR

Abstract

and

latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at https://github.com/kaist-dmlab/Hybridiff.

Paper Structure (25 sections, 18 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 18 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Method
Overview
Hybrid Parallel Inference Framework
Adaptive Switching via Denoising Discrepancy
Theoretical Analysis of Adaptive Switching
Extensibility to Many GPU Configurations
Experiments
Experimental Setup
Main Results
Ablation Study
Sensitivity Analysis
Conclusion
...and 10 more sections

Figures (10)

Figure 1: Comparison of parallel strategies for diffusion inference. (a) Patch-based data parallel frameworks suffer from bottlenecks caused by all-gather operations and artifacts at patch boundaries, leading to limited acceleration and quality degradation. (b) Pipeline parallel frameworks incur excessive asynchronous communication overhead and accumulate estimate errors. (c) Our hybrid parallelism, which incorporates condition-based data parallelism, adaptively combines both paradigms to achieve high fidelity and fast generation.
Figure 2: Overview of the proposed diffusion inference hybrid parallel framework. Our method adaptively switches parallelism modes at $\tau_1$ and $\tau_2$, optimizing the trade-off between computational efficiency and consistency of conditional guidance, and demonstrates superior inference acceleration performance while preserving high generation quality.
Figure 3: Illustration of the $\boldsymbol{\text{rel-MAE}_t(\epsilon_c, \epsilon_u)}$ curve. The $\text{rel-MAE}_t(\epsilon_c, \epsilon_u)$ value is relatively large before $\tau_1$ and after $\tau_2$, while it converges near zero between them, indicating stable alignment between conditional and unconditional branches during the parallelism phase.
Figure 4: Qualitative results of the main experiments. We compare 1024$\times$1024 image generations from the SDXL model. Our method achieves the best acceleration and FID performance, while producing visuals most similar to the original.
Figure 5: Visualization of speed–quality trade-off across different parallelism intervals $\boldsymbol{k}$. Smaller $k$ values preserve higher fidelity, whereas larger $k$ achieve greater acceleration. Our method consistently dominates prior works across the trade-off frontier. All experiments were conducted on 2 GPUs.
...and 5 more figures

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

TL;DR

Abstract

Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling

Authors

TL;DR

Abstract

Table of Contents

Figures (10)