Table of Contents
Fetching ...

Diffusion Models Are Innate One-Step Generators

Bowen Zheng, Tianming Yang

TL;DR

This work identifies a core limitation in existing diffusion-model distillation: teacher–student mismatches in multi-step to one-step transfer hinder performance. It introduces GAN Distillation at Distribution Level (GDD), a distribution-level, data-grounded distillation method that eschews instance-level supervision, achieving SOTA results with only 5M training images. It further reveals that diffusion models exhibit innate one-step capabilities when most convolutional layers are frozen, leading to GDD-I, which often outperforms GDD. Across CIFAR-10, AFHQv2 64x64, FFHQ 64x64, and ImageNet 64x64, the proposed approach delivers strong sample quality with far lower data and compute requirements, highlighting a new direction for efficient diffusion distillation. The findings illuminate the time-step–dependent activation of DM layers and open avenues for combining speed with broad diffusion-model capabilities such as editing and inpainting in future work.

Abstract

Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation enables this innate capability and leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs.

Diffusion Models Are Innate One-Step Generators

TL;DR

This work identifies a core limitation in existing diffusion-model distillation: teacher–student mismatches in multi-step to one-step transfer hinder performance. It introduces GAN Distillation at Distribution Level (GDD), a distribution-level, data-grounded distillation method that eschews instance-level supervision, achieving SOTA results with only 5M training images. It further reveals that diffusion models exhibit innate one-step capabilities when most convolutional layers are frozen, leading to GDD-I, which often outperforms GDD. Across CIFAR-10, AFHQv2 64x64, FFHQ 64x64, and ImageNet 64x64, the proposed approach delivers strong sample quality with far lower data and compute requirements, highlighting a new direction for efficient diffusion distillation. The findings illuminate the time-step–dependent activation of DM layers and open avenues for combining speed with broad diffusion-model capabilities such as editing and inpainting in future work.

Abstract

Diffusion Models (DMs) have achieved great success in image generation and other fields. By fine sampling through the trajectory defined by the SDE/ODE solver based on a well-trained score model, DMs can generate remarkable high-quality results. However, this precise sampling often requires multiple steps and is computationally demanding. To address this problem, instance-based distillation methods have been proposed to distill a one-step generator from a DM by having a simpler student model mimic a more complex teacher model. Yet, our research reveals an inherent limitations in these methods: the teacher model, with more steps and more parameters, occupies different local minima compared to the student model, leading to suboptimal performance when the student model attempts to replicate the teacher. To avoid this problem, we introduce a novel distributional distillation method, which uses an exclusive distributional loss. This method exceeds state-of-the-art (SOTA) results while requiring significantly fewer training images. Additionally, we show that DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step. Freezing most of the convolutional layers in a DM during distributional distillation enables this innate capability and leads to further performance improvements. Our method achieves the SOTA results on CIFAR-10 (FID 1.54), AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency. Most of those results are obtained with only 5 million training images within 6 hours on 8 A100 GPUs.
Paper Structure (24 sections, 6 equations, 19 figures, 6 tables)

This paper contains 24 sections, 6 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Proposed method and comparison of other methods on ImageNet 64x64. A. Comparison on ImageNet 64x64. Our methods achieve SOTA results with fewer training data. B. Proposed method. We initialize generator with a pre-trained diffusion model and freeze most of the convolutional layers, then adopt exclusively a distributional level loss.
  • Figure 2: rel-FIDs show strong correlation with the steps and sigmas. A. rel-FIDs (blue) and abs-FIDs (grey) for teacher models with different steps. B. The abs-FIDs (grey) and rel-FIDs (blue) for two-step teacher models with different sigma of intermediate point. FIDs are computed from 50,000 images generated by each model with fixed seeds from 0 to 49,999, and an inception model is used to generate features required for calculating FID.
  • Figure 3: The relative value of the last convolutional layer in each UNet block at different depth. Values are normalized with min-max normalization across time steps for each layer. Red-to-blue color gradient indicates shallow to deeper layers. A. Encoder. B. Decoder.
  • Figure 4: Comparison and effect of CD loss. A. Comparison of CD and CD-I. B. Effect of adding an extra CD loss.
  • Figure 5: CIFAR-10, EDM, NFE=18, FID=1.96
  • ...and 14 more figures