Parallel Sampling of Diffusion Models

Andy Shih; Suneel Belkhale; Stefano Ermon; Dorsa Sadigh; Nima Anari

Parallel Sampling of Diffusion Models

Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari

TL;DR

ParaDiGMS introduces a novel axis for accelerating diffusion model sampling by solving denoising steps in parallel via Picard iterations, trading compute for speed while preserving sample quality. The method yields 2-4x speedups across robotics policies and image-generation models and remains compatible with existing fast samplers like DDIM and DPMSolver. It employs a sliding-window, noise-upfront, and tolerance-based stopping criterion to ensure convergence within a controlled total-variation distance from sequential sampling. The approach enables interactive, real-time diffusion applications and scales with hardware, particularly on multi-GPU setups, while maintaining FID/CLIP and task rewards. Overall, ParaDiGMS broadens the practical feasibility of diffusion-based systems in robotics and vision tasks without sacrificing output quality.

Abstract

Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.

Parallel Sampling of Diffusion Models

TL;DR

Abstract

Paper Structure (21 sections, 2 theorems, 12 equations, 5 figures, 6 tables)

This paper contains 21 sections, 2 theorems, 12 equations, 5 figures, 6 tables.

Introduction
Background
Reducing the number of denoising steps
Parallel computation of denoising steps
Practical considerations
Experiments
Diffusion policy
Diffusion image generation
Latent-space diffusion models
Pixel-space diffusion models
Related work
Conclusion
Limitations
Discussion
Acknowledgments
...and 6 more sections

Key Result

proposition 1

(Proof in app:proofconverge)

Figures (5)

Figure 1: Computation graph of sequential sampling by evaluating $p_\theta\parens*{\bx_{t+1} \given \bx_t}$, from the perspective of reverse time.
Figure 2: Computation graph of Picard iterations, which introduces skip dependencies.
Figure 3: ParaDiGMS algorithm: accelerating an ODE solver by computing the drift at multiple timesteps in parallel. During iteration $k$, we process in parallel a batch window of size $p$ spanning timesteps $[t,t+p)$. The new values at a point $\bx^{k+1}_{t+j}$ are updated based on the value $\bx^k_j$ at the left end of the window plus the cumulative drift $1/T \sum_{i=t}^{t+j-1} s(\bx^k_{i}, i/T)$ of points in the window. We then slide the window forward until the error is greater than our tolerance, and repeat for the next iteration.
Figure 4: StableDiffusion-v2 generating text-conditioned 768x768 images by running ParaDDPM over a 4x96x96 latent space for 1000 steps, on A100 GPUs. In \ref{['fig:sd_ddpm1000_efficiency']} algorithm inefficiency in gray denotes the relative number of model evaluations required as the parallel batch window size increases. The colored lines denote the hardware efficiency provided by the multi-GPUs. As the batch window size increases, the hardware efficiency overtakes the algorithm inefficiency. In \ref{['fig:sd_ddpm1000_speedup']} we normalize the algorithm inefficiency to $1$, to show the net wall-clock speedup of parallel sampling.
Figure 5: Unconditional generation of 256x256 images on diffusion models pretrained on the LSUN Church and Bedroom dataset, running ParaDDPM for 1000 steps on A100 GPUs. We plot the net speedup after dividing the hardware efficiency by the algorithm inefficiency as the batch window size increases. Note that \ref{['tab:lsun']} shows better speedups because for \ref{['tab:lsun']} we use a better parallel implementation with multiprocessing instead of DataParallel.

Theorems & Definitions (4)

proposition 1
proposition 2
proof : Proof of \ref{['prop:converge']}
proof : Proof of \ref{['prop:tvd']}

Parallel Sampling of Diffusion Models

TL;DR

Abstract

Parallel Sampling of Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)