Table of Contents
Fetching ...

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang

TL;DR

Diffusion models suffer from high latency due to multi-step sequential denoising. AsyncDiff partitions the denoising network across multiple devices and uses asynchronous denoising by reusing hidden-state information from nearby steps, augmented with stride denoising. It achieves substantial latency reductions on image and video diffusion models with minimal degradation in generative quality, demonstrated on models like SD 2.1, SDXL, SVD, and AnimateDiff. The approach provides a practical, plug-and-play path to scalable diffusion inference on multi-GPU systems.

Abstract

Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

TL;DR

Diffusion models suffer from high latency due to multi-step sequential denoising. AsyncDiff partitions the denoising network across multiple devices and uses asynchronous denoising by reusing hidden-state information from nearby steps, augmented with stride denoising. It achieves substantial latency reductions on image and video diffusion models with minimal degradation in generative quality, demonstrated on models like SD 2.1, SDXL, SVD, and AnimateDiff. The approach provides a practical, plug-and-play path to scalable diffusion inference on multi-GPU systems.

Abstract

Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.
Paper Structure (17 sections, 8 equations, 12 figures, 8 tables)

This paper contains 17 sections, 8 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: We introduce a new distributed acceleration paradigm that attains a 2.8x speed-up on Stable Diffusion XL while maintaining pixel-level consistency, using four NVIDIA A5000 GPUs.
  • Figure 2: By preparing each component's input beforehand, we enable parallel computation of the denoising model, which substantially reduces latency while minimally affecting quality.
  • Figure 3: Overview of the asynchronous denoising process. The denoising model $\epsilon_\theta$ is divided into four components $\{\epsilon_\theta^n\}_{n=1}^{4}$ for clarity. Following the warm-up stage, each component's input is prepared in advance, breaking the dependency chain and facilitating parallel processing.
  • Figure 4: Illustration of stride denoising. The model $\epsilon_\theta$ is divided into three components $\{\epsilon_\theta^n\}_{n=1}^{3}$, with a stride $S$ of 2 for clarity. Components $\epsilon_\theta^1$ and $\epsilon_\theta^2$ are skipped at time step $t$. A single parallel batch results in the completion of denoising for two steps, producing $x_{t-1}$ and $x_{t-2}$.
  • Figure 5: Qualitative Results. (a) Our method significantly accelerates the denoising process with minimal impact on generative quality. (b) Increasing warm-up steps achieves pixel-level consistency with the original output while maintaining a high speed-up ratio.
  • ...and 7 more figures