Table of Contents
Fetching ...

BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, Josh Susskind

TL;DR

BOOT addresses the slow inference of diffusion models by introducing a data-free distillation framework that bootstraps from Gaussian noise to train a time-conditioned single-step student. Central to the method is the Signal-ODE, which operates in the low-frequency signal space, and bootstrapping objectives that avoid requiring real data or EMA maintenance. The approach yields competitive quality with substantial speed-ups on unconditional, class-conditioned, and text-to-image diffusion models, and supports controllable generation via guidance and latent-space manipulation. This data-free, scalable distillation broadens the practical deployment of diffusion models, including large-scale text-to-image systems, under restricted data-access scenarios.

Abstract

Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few without significant quality degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling.

BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping

TL;DR

BOOT addresses the slow inference of diffusion models by introducing a data-free distillation framework that bootstraps from Gaussian noise to train a time-conditioned single-step student. Central to the method is the Signal-ODE, which operates in the low-frequency signal space, and bootstrapping objectives that avoid requiring real data or EMA maintenance. The approach yields competitive quality with substantial speed-ups on unconditional, class-conditioned, and text-to-image diffusion models, and supports controllable generation via guidance and latent-space manipulation. This data-free, scalable distillation broadens the practical deployment of diffusion models, including large-scale text-to-image systems, under restricted data-access scenarios.

Abstract

Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few without significant quality degradation. However, existing distillation methods either require significant amounts of offline computation for generating synthetic training data from the teacher model or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmark datasets in the DDIM setting, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that the proposed approach is able to handle highly complex distributions, shedding light on more efficient generative modeling.
Paper Structure (46 sections, 17 equations, 20 figures, 2 tables, 1 algorithm)

This paper contains 46 sections, 17 equations, 20 figures, 2 tables, 1 algorithm.

Figures (20)

  • Figure 2: Comparison of Consistency Model song2023consistency ( red $\uparrow$) and BOOT (black $\downarrow$) highlighting the opposing prediction pathways.
  • Figure 3: Training pipeline of BOOT. $s$ and $t$ are two consecutive timesteps where $s<t$. From a noise map $\veps$, the objective of BOOT minimizes the difference between the output of a student model at timestep $s$, and the output of stacking the same student model and a teacher model at an earlier time $t$. The whole process is data-free.
  • Figure 4: Comparison between the generated outputs of DDIM/Signal-ODE and our distilled model given the same prompt A raccoon wearing a space suit, wearing a helmet. Oil painting in the style of Rembrandt and initial noise input. By definition, signal-ODE converges to the same final sample as the original DDIM, while the distilled single-step model does not necessarily follow.
  • Figure 5: Uncurated samples of {50, 10, 1} DDIM sampling steps and the proposed BOOT from (a) FFHQ (b) LSUN (c) ImageNet benchmarks, respectively, given the same set of initial noise input.
  • Figure 6: Uncurated samples of {50, 10, 1} DDIM sampling steps and the proposed BOOT from SD2.1-base, given the same set of initial noise input and prompts sampled from diffusiondb.
  • ...and 15 more figures