Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao; Siavash Khodadadeh; Kevin Duarte; Wei-An Lin; Hui Qu; Mingi Kwon; Ratheesh Kalarot

Plug-and-Play Diffusion Distillation

Yi-Ting Hsiao, Siavash Khodadadeh, Kevin Duarte, Wei-An Lin, Hui Qu, Mingi Kwon, Ratheesh Kalarot

TL;DR

Diffusion models are powerful but slow due to iterative denoising and classifier-free guidance. The paper introduces a plug-and-play distillation framework that trains an external lightweight guide while freezing the base model, enabling faster inference with roughly $0.5\times$ the FLOPs and only $1\%$ of the base-model parameters. The approach provides two guide architectures (full and tiny) and a CFG distillation formulation that injects feature maps into the decoder to mimic guidance in a single forward pass, followed by progressive sampling-step distillation to further reduce steps. Importantly, the guide module generalizes across domain-specific, fine-tuned diffusion models without retraining, maintaining competitive FID/CLIP at 8–16 steps and enabling plug-in to diverse styles (e.g., watercolor, realistic, 3D cartoon). These results demonstrate a practical path toward efficient, adaptable diffusion-based generation for real-world applications.

Abstract

Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1\% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

Plug-and-Play Diffusion Distillation

TL;DR

the FLOPs and only

of the base-model parameters. The approach provides two guide architectures (full and tiny) and a CFG distillation formulation that injects feature maps into the decoder to mimic guidance in a single forward pass, followed by progressive sampling-step distillation to further reduce steps. Importantly, the guide module generalizes across domain-specific, fine-tuned diffusion models without retraining, maintaining competitive FID/CLIP at 8–16 steps and enabling plug-in to diverse styles (e.g., watercolor, realistic, 3D cartoon). These results demonstrate a practical path toward efficient, adaptable diffusion-based generation for real-world applications.

Abstract

Paper Structure (33 sections, 5 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Related work
Reducing inference time in diffusion models
Controlling diffusion models
Preliminary
Background on diffusion models
Classifier free guidance
Methodology
Overview
CFG distillation
Guide model architecture
full guide model
tiny guide model
Sampling steps distillation
Experiments
...and 18 more sections

Figures (11)

Figure 1: We trained a guide model to replace classifier-free guidance that can be plug-and-play to other base models with different domains.
Figure 2: Applying our trained guide model to different fine-tuned latent diffusion models (LDM).
Figure 3: The overview of CFG distillation. Instead of using two feed-forward pass and classifier-free guidance, we train a student model conditioned with the guidance, cross-attention, and time embedding to predict the output image with only one forward pass.
Figure 4: Comparison of the full guide model architecture and the tiny architecture.
Figure 5: The results of the full guide model and tiny guide model under 8, 16, and 50 sampling steps.
...and 6 more figures

Plug-and-Play Diffusion Distillation

TL;DR

Abstract

Plug-and-Play Diffusion Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)