One-Step Diffusion Distillation through Score Implicit Matching
Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, Guo-jun Qi
TL;DR
The paper tackles the bottleneck of slow sampling in diffusion models by introducing Score Implicit Matching (SIM), a data-free framework that distills pre-trained diffusion models into single-step generators. SIM leverages a broad class of score-based divergences between the teacher's scores and a student generator, and uses a score-gradient theorem to obtain tractable gradients without requiring explicit backpropagation through the intractable student scores. Different distance functions, notably the Pseudo-Huber distance, are explored to improve robustness, convergence speed, and stability, with SiD identified as a special case within SIM. Empirically, SIM delivers state-of-the-art or competitive one-step results on CIFAR-10 and, notably, distills a transformer-based text-to-image diffusion model into a one-step generator with an aesthetic score of 6.42, outperforming multiple baselines while remaining data-free and computationally efficient. These results indicate SIM’s practical potential for rapid, high-quality one-step generative models across vision and multimodal tasks, enabling industry-scale deployment and broader exploration of diffusion-transformer distillation.
Abstract
Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trained diffusion models into more efficient models, but these methods still typically require few-step inference or perform substantially worse than the underlying model. In this paper, we present Score Implicit Matching (SIM) a new approach to distilling pre-trained diffusion models into single-step generator models, while maintaining almost the same sample generation ability as the original model as well as being data-free with no need of training samples for distillation. The method rests upon the fact that, although the traditional score-based loss is intractable to minimize for generator models, under certain conditions we can efficiently compute the gradients for a wide class of score-based divergences between a diffusion model and a generator. SIM shows strong empirical performances for one-step generators: on the CIFAR10 dataset, it achieves an FID of 2.06 for unconditional generation and 1.96 for class-conditional generation. Moreover, by applying SIM to a leading transformer-based diffusion model, we distill a single-step generator for text-to-image (T2I) generation that attains an aesthetic score of 6.42 with no performance decline over the original multi-step counterpart, clearly outperforming the other one-step generators including SDXL-TURBO of 5.33, SDXL-LIGHTNING of 5.34 and HYPER-SDXL of 5.85. We will release this industry-ready one-step transformer-based T2I generator along with this paper.
