Table of Contents
Fetching ...

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han

TL;DR

DistriFusion tackles the latency of high-resolution diffusion model generation by distributing work across multiple GPUs through displaced patch parallelism, reusing activations from prior steps to enable patch interactions with asynchronous communication. The approach preserves image quality while delivering substantial speedups (up to 6.1× on 8 A100 GPUs) on SDXL and scales with batch usage. Key contributions include sparse per-patch computation, AllGather-based context sharing, corrected asynchronous GroupNorm, and warm-up steps to maintain fidelity in few-step sampling. This work offers a practical, training-free path to real-time high-resolution diffusion inference on multi-GPU systems and points to hardware-aware optimizations as a future direction.

Abstract

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

TL;DR

DistriFusion tackles the latency of high-resolution diffusion model generation by distributing work across multiple GPUs through displaced patch parallelism, reusing activations from prior steps to enable patch interactions with asynchronous communication. The approach preserves image quality while delivering substantial speedups (up to 6.1× on 8 A100 GPUs) on SDXL and scales with batch usage. Key contributions include sparse per-patch computation, AllGather-based context sharing, corrected asynchronous GroupNorm, and warm-up steps to maintain fidelity in few-step sampling. This work offers a practical, training-free path to real-time high-resolution diffusion inference on multi-GPU systems and points to hardware-aware optimizations as a future direction.

Abstract

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for interactive applications. In this paper, we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However, naively implementing such an algorithm breaks the interaction between patches and loses fidelity, while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma, we observe the high similarity between the input from adjacent diffusion steps and propose displaced patch parallelism, which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore, our method supports asynchronous communication, which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1 speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser.
Paper Structure (9 sections, 2 equations, 9 figures, 2 tables)

This paper contains 9 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: We introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. Naı̈ve Patch (Figure \ref{['fig:idea']}(b)) suffers from the fragmentation issue due to the lack of patch interaction. Our DistriFusion removes artifacts and avoids the communication overhead by reusing the features from the previous steps. Setting: SDXL with 50-step Euler sampler, $1280\times1920$ resolution. Latency is measured on A100s.
  • Figure 2: (a) Original diffusion model running on a single device. (b) Naı̈vely splitting the image into 2 patches across 2 GPUs has an evident seam at the boundary due to the absence of interaction across patches. (c) DistriFusion employs synchronous communication for patch interaction at the first step. After that, we reuse the activations from the previous step via asynchronous communication. In this way, the communication overhead can be hidden into the computation pipeline.
  • Figure 3: Overview of DistriFusion. For simplicity, we omit the inputs of $t$ and $c$, and use $N=2$ devices as an example. Superscripts $^{(1)}$ and $^{(2)}$ represent the first and the second patch, respectively. Stale activations from the previous step are darkened. At each step $t$, we first split the input $\mathbf x_t$ into $N$ patches $\mathbf x_t^{(1)},\ldots,\mathbf x_t^{(N)}$. For each layer $l$ and device $i$, upon getting the input activation patches $\mathbf A_{t}^{l,(i)}$, two operations then process asynchronously: First, on device $i$, $\mathbf A_{t}^{l, (i)}$ is scattered back into the stale activation $\mathbf A_{t+1}^l$ from the previous step. The output of this Scatter operation is then fed into the sparse operator $F_l$ (linear, convolution, or attention layers), which performs computations exclusively on the fresh regions and produces the corresponding output. Meanwhile, an AllGather operation is performed over $\mathbf{A}_{t}^{l, (i)}$ to prepare the full activation $\mathbf{A}_{t}^l$ for the next step. We repeat this procedure for each layer. The final outputs are then aggregated together to approximate $\epsilon_\theta(\mathbf x_t)$, which is used to compute $\mathbf x_{t-1}$. The timeline visualization of each device for predicting $\epsilon_\theta(\mathbf x_t)$ is shown in Figure \ref{['fig:timeline']}.
  • Figure 4: Timeline visualization on each device when predicting $\epsilon_\theta(\mathbf x_t)$. Comm. means communication, which is asynchronous with computation. The AllGather overhead is fully hidden within the computation.
  • Figure 5: Qualitative results. FID is computed against the ground-truth images. Our DistriFusion can reduce the latency according to the number of used devices while preserving visual fidelity.
  • ...and 4 more figures