Table of Contents
Fetching ...

Distilling Diffusion Models into Conditional GANs

Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park

TL;DR

This work tackles the slow inference of diffusion models by distilling a pretrained multi-step teacher into a one-step conditional GAN, formulated as paired noise-to-image translation. It introduces E-LatentLPIPS, a latent-space perceptual loss with ensembled differentiable augmentations, and a multi-scale conditional diffusion discriminator initialized from the teacher to preserve alignment with text prompts. The resulting Diffusion2GAN achieves state-of-the-art one-step performance on COCO and SDXL benchmarks, outperforming prior distillation methods in FID and CLIP metrics while enabling interactive speeds. This approach enables faster, more scalable text-to-image generation, with potential benefits for real-time creative tools and broader applications, albeit with considerations around ethical use and dependence on teacher quality.

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.

Distilling Diffusion Models into Conditional GANs

TL;DR

This work tackles the slow inference of diffusion models by distilling a pretrained multi-step teacher into a one-step conditional GAN, formulated as paired noise-to-image translation. It introduces E-LatentLPIPS, a latent-space perceptual loss with ensembled differentiable augmentations, and a multi-scale conditional diffusion discriminator initialized from the teacher to preserve alignment with text prompts. The resulting Diffusion2GAN achieves state-of-the-art one-step performance on COCO and SDXL benchmarks, outperforming prior distillation methods in FID and CLIP metrics while enabling interactive speeds. This approach enables faster, more scalable text-to-image generation, with potential benefits for real-time creative tools and broader applications, albeit with considerations around ethical use and dependence on teacher quality.

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.
Paper Structure (23 sections, 9 equations, 14 figures, 11 tables)

This paper contains 23 sections, 9 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Visual comparison to SDXL teacher podell2024sdxl with a classifier-free guidance scale ho2022classifier of 7 and selected distillation student models, including SDXL-Turbo sauer2023adversarial, SDXL-Lightning lin2024sdxl, and our SDXL-Diffusion2GAN. All images in a given row were generated using the same noise input, except for SDXL-Turbo, which requires a distinct noise size of $4\times64\times64$. Compared to other distillation models, our SDXL-Diffusion2GAN more closely adheres to the original ODE trajectory.
  • Figure 2: Visual comparison to Stable Diffusion 1.5 teacher stablediffusion1.5 with a classifier-free guidance scale ho2022classifier of 8 and selected distillation student models, including InstaFlow-0.9B liu2023insta, LCM-LoRA luo2023latentlora, and our Diffusion2GAN. The same noise input was used to generate images in the same row. Our method Diffusion2GAN achieves higher realism than the 2-step LCM-LoRA and InstaFlow-0.9B.
  • Figure 3: High-quality generated images using our one-step Diffusion2GAN framework. Our model can synthesize a 512px/1024px image at an interactive speed of 0.09/0.16 seconds on an A100 GPU, while the teacher model, Stable Diffusion 1.5 stablediffusion1.5/SDXL podell2024sdxl, produces an image in 2.59/5.60 seconds using 50 steps of the DDIM song2021denoising. Please visit our https://mingukkang.github.io/Diffusion2GAN/ for more visual results.
  • Figure 4: E-LatentLPIPS for latent space distillation. Training a single iteration with LPIPS zhang2018unreasonable takes 117ms and 15.0GB extra memory on NVIDIA A100, whereas our E-LatentLPIPS requires 12.1ms and 0.6GB on the same device. Consequently, E-latentLPIPS accelerates the perceptual loss computation time by $9.7\times$ compared to LPIPS, while simultaneously reducing memory consumption.
  • Figure 5: Single image reconstruction. To gain insight into the loss landscape of our regression loss, we conduct an image reconstruction experiment by directly optimizing a single latent with different loss functions. Reconstruction with LPIPS roughly reproduces the target image, but at the cost of needing to decode into pixels. LatentLPIPS alone cannot precisely reconstruct the image. However, our ensembled augmentation, E-LatentLPIPS, can more precisely reconstruct the target while operating directly in the latent space.
  • ...and 9 more figures