Table of Contents
Fetching ...

DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease

Runsheng Bai, Chengyu Zhang, Yangdong Deng

Abstract

Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of $\tfrac{1}{n}$ or $\tfrac{2}{n+1}$, depending on whether the conservative or aggressive mode is used, where $n$ denotes the number of devices. Empirically, DRiffusion attains 1.4$\times$-3.7$\times$ speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.

DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease

Abstract

Diffusion models have achieved remarkable success in generating high-fidelity content but suffer from slow, iterative sampling, resulting in high latency that limits their use in interactive applications. We introduce DRiffusion, a parallel sampling framework that parallelizes diffusion inference through a draft-and-refine process. DRiffusion employs skip transitions to generate multiple draft states for future timesteps and computes their corresponding noises in parallel, which are then used in the standard denoising process to produce refined results. Theoretically, our method achieves an acceleration rate of or , depending on whether the conservative or aggressive mode is used, where denotes the number of devices. Empirically, DRiffusion attains 1.4-3.7 speedup across multiple diffusion models while incur minimal degradation in generation quality: on MS-COCO dataset, both FID and CLIP remain largely on par with those of the original model, while PickScore and HPSv2.1 show only minor average drops of 0.17 and 0.43, respectively. These results verify that DRiffusion delivers substantial acceleration and preserves perceptual quality.

Paper Structure

This paper contains 28 sections, 14 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison across different prompts. Top row: original outputs; bottom row: results of 3-device aggressive parallelization.
  • Figure 2: Temporal dependencies of different methods. In DDPM, timesteps are sequential. DDIM and Euler Solvers allow global re-selection, but timestep traverse pattern within the subsequence or discretization remains sequential. The skip transitions allow direct moves between arbitrary pairs of timesteps, offering local temporal flexibility and enabling parallelization.
  • Figure 3: Computation map of different versions. Conservative one takes a standalone timestep for computing the noise for extra steps predicting while aggressive one directly takes the noise computed from draft states.
  • Figure 4: Qualitative comparison at 50 steps on MS-COCO using Stable Diffusion 2.1. The first row displays conservative results and the second row shows aggressive ones. Our parallelization method accelerates the sampling process with minimal impact on sample quality. The prompt used for generation is: A photo of a cat sitting on a wooden table in sunlight.
  • Figure 5: Latency scaling of DRiffusion with respect to the number of devices (N) on Stable Diffusion 2.1 base using 50 sampling steps. Both modes closely follow their theoretical lower bounds.
  • ...and 1 more figures