Table of Contents
Fetching ...

One-Step Diffusion Distillation through Score Implicit Matching

Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, Guo-jun Qi

TL;DR

The paper tackles the bottleneck of slow sampling in diffusion models by introducing Score Implicit Matching (SIM), a data-free framework that distills pre-trained diffusion models into single-step generators. SIM leverages a broad class of score-based divergences between the teacher's scores and a student generator, and uses a score-gradient theorem to obtain tractable gradients without requiring explicit backpropagation through the intractable student scores. Different distance functions, notably the Pseudo-Huber distance, are explored to improve robustness, convergence speed, and stability, with SiD identified as a special case within SIM. Empirically, SIM delivers state-of-the-art or competitive one-step results on CIFAR-10 and, notably, distills a transformer-based text-to-image diffusion model into a one-step generator with an aesthetic score of 6.42, outperforming multiple baselines while remaining data-free and computationally efficient. These results indicate SIM’s practical potential for rapid, high-quality one-step generative models across vision and multimodal tasks, enabling industry-scale deployment and broader exploration of diffusion-transformer distillation.

Abstract

Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trained diffusion models into more efficient models, but these methods still typically require few-step inference or perform substantially worse than the underlying model. In this paper, we present Score Implicit Matching (SIM) a new approach to distilling pre-trained diffusion models into single-step generator models, while maintaining almost the same sample generation ability as the original model as well as being data-free with no need of training samples for distillation. The method rests upon the fact that, although the traditional score-based loss is intractable to minimize for generator models, under certain conditions we can efficiently compute the gradients for a wide class of score-based divergences between a diffusion model and a generator. SIM shows strong empirical performances for one-step generators: on the CIFAR10 dataset, it achieves an FID of 2.06 for unconditional generation and 1.96 for class-conditional generation. Moreover, by applying SIM to a leading transformer-based diffusion model, we distill a single-step generator for text-to-image (T2I) generation that attains an aesthetic score of 6.42 with no performance decline over the original multi-step counterpart, clearly outperforming the other one-step generators including SDXL-TURBO of 5.33, SDXL-LIGHTNING of 5.34 and HYPER-SDXL of 5.85. We will release this industry-ready one-step transformer-based T2I generator along with this paper.

One-Step Diffusion Distillation through Score Implicit Matching

TL;DR

The paper tackles the bottleneck of slow sampling in diffusion models by introducing Score Implicit Matching (SIM), a data-free framework that distills pre-trained diffusion models into single-step generators. SIM leverages a broad class of score-based divergences between the teacher's scores and a student generator, and uses a score-gradient theorem to obtain tractable gradients without requiring explicit backpropagation through the intractable student scores. Different distance functions, notably the Pseudo-Huber distance, are explored to improve robustness, convergence speed, and stability, with SiD identified as a special case within SIM. Empirically, SIM delivers state-of-the-art or competitive one-step results on CIFAR-10 and, notably, distills a transformer-based text-to-image diffusion model into a one-step generator with an aesthetic score of 6.42, outperforming multiple baselines while remaining data-free and computationally efficient. These results indicate SIM’s practical potential for rapid, high-quality one-step generative models across vision and multimodal tasks, enabling industry-scale deployment and broader exploration of diffusion-transformer distillation.

Abstract

Despite their strong performances on many generative tasks, diffusion models require a large number of sampling steps in order to generate realistic samples. This has motivated the community to develop effective methods to distill pre-trained diffusion models into more efficient models, but these methods still typically require few-step inference or perform substantially worse than the underlying model. In this paper, we present Score Implicit Matching (SIM) a new approach to distilling pre-trained diffusion models into single-step generator models, while maintaining almost the same sample generation ability as the original model as well as being data-free with no need of training samples for distillation. The method rests upon the fact that, although the traditional score-based loss is intractable to minimize for generator models, under certain conditions we can efficiently compute the gradients for a wide class of score-based divergences between a diffusion model and a generator. SIM shows strong empirical performances for one-step generators: on the CIFAR10 dataset, it achieves an FID of 2.06 for unconditional generation and 1.96 for class-conditional generation. Moreover, by applying SIM to a leading transformer-based diffusion model, we distill a single-step generator for text-to-image (T2I) generation that attains an aesthetic score of 6.42 with no performance decline over the original multi-step counterpart, clearly outperforming the other one-step generators including SDXL-TURBO of 5.33, SDXL-LIGHTNING of 5.34 and HYPER-SDXL of 5.85. We will release this industry-ready one-step transformer-based T2I generator along with this paper.

Paper Structure

This paper contains 37 sections, 2 theorems, 32 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.1

If distribution $p_{\theta, t}$ satisfies some mild regularity conditions, we have for any score function $\bm{s}_{q_t}(.)$, the following equation holds for all parameter $\theta$:

Figures (8)

  • Figure 1: Time for a Human Preference Study! Could you please tell us which one is better? Hint: the rightmost column is the one-step Latent Consistency Model of PixelArt-$\alpha$; The left two columns are randomly placed, with one generated from our one-step SIM-DiT-600M model, and another generated from the 14-step PixelArt-$\alpha$ teacher diffusion model. We put the answer in Appendix \ref{['app:user_study_answer']}.
  • Figure 2: Left Two: Comparison of distillation methods with a batch size of 256 and a learning rate of $1e-4$. (Left): the FID value. (Right): the Inception Scores. Right Two: Comparison of distillation methods with a batch size of 256 and a learning rate of $1e-5$. (Left): the FID value. (Right): the Inception Scores. All methods are constrained to the same settings except for the distillation methods.
  • Figure 3: Qualitative comparison of SIM-DiT-600M against other few-step text-to-image models. Please zoom in to check details, lighting, and aesthetic performances. Prompts in Appendix \ref{['app:prompts']}.
  • Figure 4: Visualization of bad generation cases of one-step SIM-DiT model.
  • Figure 5: Demonstration of our human preference user study interface.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Score-divergence gradient Theorem
  • Theorem A.1: Score-projection identity