Table of Contents
Fetching ...

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

Clément Chadebec, Onur Tasar, Eyal Benaroche, Benjamin Aubin

TL;DR

Flash Diffusion introduces a versatile distillation framework that trains a lightweight student to imitate a multi-step teacher's denoising in a single pass, aided by timesteps sampling, an adversarial latent-space objective, and distribution matching. By applying LoRA and freezing the teacher, it achieves state-of-the-art performance for few-step generation on COCO benchmarks with far fewer trainable parameters and training hours. The approach demonstrates broad applicability across conditioning types, backbones, and auxiliary tasks (inpainting, super-resolution, face-swapping) and enables training-free integration with adapters. Overall, the method offers a practical path to real-time diffusion-based generation with competitive quality and wide compatibility across architectures.

Abstract

In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$α$), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

TL;DR

Flash Diffusion introduces a versatile distillation framework that trains a lightweight student to imitate a multi-step teacher's denoising in a single pass, aided by timesteps sampling, an adversarial latent-space objective, and distribution matching. By applying LoRA and freezing the teacher, it achieves state-of-the-art performance for few-step generation on COCO benchmarks with far fewer trainable parameters and training hours. The approach demonstrates broad applicability across conditioning types, backbones, and auxiliary tasks (inpainting, super-resolution, face-swapping) and enables training-free integration with adapters. Overall, the method offers a practical path to real-time diffusion-based generation with competitive quality and wide compatibility across architectures.

Abstract

In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.
Paper Structure (46 sections, 15 equations, 18 figures, 1 table)

This paper contains 46 sections, 15 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Inputs (left columns) and generated samples (right columns) using the proposed method for different teacher models and tasks (super-resolution, inpainting, face-swapping and adapters). Samples are generated using 4 Neural Function Evaluations (NFEs).
  • Figure 2: Illustration of the evolution of the proposed timesteps distribution $\pi$ throughout training. $t=0$ corresponds to no noise injection while $t=1$ corresponds to the maximum noise injection (i.e. the noisy latent sample is equivalent to a sample drawn from a standard Gaussian distribution). For each phase unless the Warm-up, 4 timesteps are over-sampled out of the $K=32$ selected ones. As the training progresses, the probability mass is shifted towards full noise to favor single-step generation.
  • Figure 3: Flash Diffusion training method: the student is trained with a distillation loss between multiple-step teacher and single-step student denoised samples. The student predictions are then re-noised and denoised with the teacher and student before evaluating the GAN and DMD losses.
  • Figure 4: Qualitative evaluation of the sample quality as the number of NFEs increases for the proposed method applied to SD1.5 model. Best viewed zoomed in.
  • Figure 5: From left to right and top to bottom: a) FID-5k and CLIP score on COCO2017 validation set for SD1.5 as teacher. b) FID-30k on MS COCO2014 validation set for SD1.5 as teacher ($^{\dagger}$ results from yin2023one). c) Influence of the guidance scale used to generate with the teacher, d) the loss terms e) the timestep sampling $\pi(t)$, f) the distillation loss, g) the GAN loss and h) the value of $K$ in Eq. \ref{['eq:mass distribution']}.
  • ...and 13 more figures