Table of Contents
Fetching ...

Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot

TL;DR

Diffusion models are powerful but slow due to iterative denoising and classifier-free guidance. The paper introduces a plug-and-play distillation framework that trains an external lightweight guide while freezing the base model, enabling faster inference with roughly $0.5\times$ the FLOPs and only $1\%$ of the base-model parameters. The approach provides two guide architectures (full and tiny) and a CFG distillation formulation that injects feature maps into the decoder to mimic guidance in a single forward pass, followed by progressive sampling-step distillation to further reduce steps. Importantly, the guide module generalizes across domain-specific, fine-tuned diffusion models without retraining, maintaining competitive FID/CLIP at 8–16 steps and enabling plug-in to diverse styles (e.g., watercolor, realistic, 3D cartoon). These results demonstrate a practical path toward efficient, adaptable diffusion-based generation for real-world applications.

Abstract

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1\% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

Plug-and-Play Diffusion Distillation

TL;DR

Diffusion models are powerful but slow due to iterative denoising and classifier-free guidance. The paper introduces a plug-and-play distillation framework that trains an external lightweight guide while freezing the base model, enabling faster inference with roughly the FLOPs and only of the base-model parameters. The approach provides two guide architectures (full and tiny) and a CFG distillation formulation that injects feature maps into the decoder to mimic guidance in a single forward pass, followed by progressive sampling-step distillation to further reduce steps. Importantly, the guide module generalizes across domain-specific, fine-tuned diffusion models without retraining, maintaining competitive FID/CLIP at 8–16 steps and enabling plug-in to diverse styles (e.g., watercolor, realistic, 3D cartoon). These results demonstrate a practical path toward efficient, adaptable diffusion-based generation for real-world applications.

Abstract

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1\% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.
Paper Structure (33 sections, 5 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: We trained a guide model to replace classifier-free guidance that can be plug-and-play to other base models with different domains.
  • Figure 2: Applying our trained guide model to different fine-tuned latent diffusion models (LDM).
  • Figure 3: The overview of CFG distillation. Instead of using two feed-forward pass and classifier-free guidance, we train a student model conditioned with the guidance, cross-attention, and time embedding to predict the output image with only one forward pass.
  • Figure 4: Comparison of the full guide model architecture and the tiny architecture.
  • Figure 5: The results of the full guide model and tiny guide model under 8, 16, and 50 sampling steps.
  • ...and 6 more figures