Distilling ODE Solvers of Diffusion Models into Smaller Steps

Sanghwan Kim; Hao Tang; Fisher Yu

Distilling ODE Solvers of Diffusion Models into Smaller Steps

Sanghwan Kim, Hao Tang, Fisher Yu

TL;DR

The paper addresses slow sampling in diffusion models by proposing Distilled-ODE solvers (D-ODE solvers), a lightweight distillation method that adds a single parameter to existing ODE solvers to better approximate the denoising outputs along the sampling trajectory. By distilling knowledge from teachers with larger steps to students with smaller steps, D-ODE solvers achieve higher-quality samples at low NFEs with negligible overhead, applicable to both noise- and data-prediction networks. Across multiple datasets and samplers (e.g., DDIM, iPNDM, DPM-Solver, DEIS, EDM), D-ODE solvers consistently improve FID at smaller NFEs and align closely with the target ODE trajectory, highlighting practical speedups without extensive retraining. Limitations include potential insufficient expressivity with a single scalar parameter for very high-resolution generation, suggesting future work on multi-parameter or localized adaptations.

Abstract

Abstract Diffusion models have recently gained prominence as a novel category of generative models. Despite their success, these models face a notable drawback in terms of slow sampling speeds, requiring a high number of function evaluations (NFE) in the order of hundreds or thousands. In response, both learning-free and learning-based sampling strategies have been explored to expedite the sampling process. Learning-free sampling employs various ordinary differential equation (ODE) solvers based on the formulation of diffusion ODEs. However, it encounters challenges in faithfully tracking the true sampling trajectory, particularly for small NFE. Conversely, learning-based sampling methods, such as knowledge distillation, demand extensive additional training, limiting their practical applicability. To overcome these limitations, we introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation approach grounded in ODE solver formulations. Our method seamlessly integrates the strengths of both learning-free and learning-based sampling. D-ODE solvers are constructed by introducing a single parameter adjustment to existing ODE solvers. Furthermore, we optimize D-ODE solvers with smaller steps using knowledge distillation from ODE solvers with larger steps across a batch of samples. Comprehensive experiments demonstrate the superior performance of D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably, our method incurs negligible computational overhead compared to previous distillation techniques, facilitating straightforward and rapid integration with existing samplers. Qualitative analysis reveals that D-ODE solvers not only enhance image quality but also faithfully follow the target ODE trajectory.

Distilling ODE Solvers of Diffusion Models into Smaller Steps

TL;DR

Abstract

Paper Structure (33 sections, 32 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 32 equations, 15 figures, 4 tables, 1 algorithm.

Introduction
Background
The Proposed Method
Correlation between Denoising Outputs
Formulation of D-ODE Solver
Knowledge Distillation of D-ODE Solver
Experiments
Noise Prediction Model
Data Prediction Model
Comparison with Previous Distillation Methods
Analysis
Visualization of Sampling Trajectory
Qualitative Analysis
Conclusion
Trilemma of Generative Models
...and 18 more sections

Figures (15)

Figure 1: The overview of D-ODE Solver. Given an input image at timestep $CT$, teacher sampling performs $C$ denoising steps to obtain the output at time step $C(T-1)$ while student sampling conducts one denoising step from an input at timestep $t$ to an output at timestep $t-1$. Then, $C$ steps of the teacher sampling are distilled into a single step of the student sampling by optimizing $\lambda_t$ within the D-ODE solver. Note that the denoising network remains frozen for both teacher and student sampling.
Figure 2: Correlation between denoising outputs. Heatmaps are drawn by cosine similarity among denoising outputs with 1000-step DDIM on CIFAR-10. Noise prediction model (left) and data prediction model (right).
Figure 3: Results on the noise prediction models. Image quality measured by FID $\downarrow$ with NFE $\in \{2, 5, 10, 25, 50, 100, 250\}$. For DPM-Solver3 and DEIS3, we use 3 NFE instead of 2 NFE as the third-order method requires at least three denoising outputs. Dotted lines denote ODE solvers while straight lines represent the applications of the D-ODE solver to them.
Figure 4: Results on the data prediction models. Image quality measured by FID $\downarrow$ with various NFE values (DDIM: {2, 5, 10, 25, 50, 100, 250} and EDM: {3, 5, 9, 25, 49, 99, 249}). Dotted lines denote ODE solvers and straight lines represent the applications of the D-ODE solver to them.
Figure 5: Analysis on local and global characteristics. The top row illustrates the change of norm comparing ODE and D-ODE solvers. The bottom row presents the update path of two randomly selected pixels in the images. The result of 1000-step DDIM is drawn as the target trajectory and a 10-step sampler is conducted for ODE solvers and D-ODE solvers. The figures are generated from 1000 samples using a noise prediction model trained on CIFAR-10.
...and 10 more figures

Distilling ODE Solvers of Diffusion Models into Smaller Steps

TL;DR

Abstract

Distilling ODE Solvers of Diffusion Models into Smaller Steps

Authors

TL;DR

Abstract

Table of Contents

Figures (15)