Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations
Nikil Roashan Selvam, Amil Merchant, Stefano Ermon
TL;DR
SRDS introduces Self-Refining Diffusion Samplers, a Parareal-inspired, parallel-in-time framework that refines diffusion trajectory estimates to yield high-quality samples with reduced latency. By coupling a fast 1-step coarse solver with parallelizable $\sqrt{N}$-step fine solves and predictor-corrector updates, SRDS guarantees convergence to the standard $N$-step solution while enabling batched inference and pipeline parallelism. Empirical results across pixel- and latent-diffusion models show substantial wallclock speedups (up to multiple-fold in longer trajectories) with preserved sample quality, at the expense of additional parallel compute and memory $\mathcal{O}(\sqrt{N})$. This modular approach offers a practical path to real-time diffusion-based applications and can integrate with a range of solvers and future multigrid strategies.
Abstract
In diffusion models, samples are generated through an iterative refinement process, requiring hundreds of sequential model evaluations. Several recent methods have introduced approximations (fewer discretization steps or distillation) to trade off speed at the cost of sample quality. In contrast, we introduce Self-Refining Diffusion Samplers (SRDS) that retain sample quality and can improve latency at the cost of additional parallel compute. We take inspiration from the Parareal algorithm, a popular numerical method for parallel-in-time integration of differential equations. In SRDS, a quick but rough estimate of a sample is first created and then iteratively refined in parallel through Parareal iterations. SRDS is not only guaranteed to accurately solve the ODE and converge to the serial solution but also benefits from parallelization across the diffusion trajectory, enabling batched inference and pipelining. As we demonstrate for pre-trained diffusion models, the early convergence of this refinement procedure drastically reduces the number of steps required to produce a sample, speeding up generation for instance by up to 1.7x on a 25-step StableDiffusion-v2 benchmark and up to 4.3x on longer trajectories.
