Distilling Diffusion Models into Conditional GANs
Minguk Kang, Richard Zhang, Connelly Barnes, Sylvain Paris, Suha Kwak, Jaesik Park, Eli Shechtman, Jun-Yan Zhu, Taesung Park
TL;DR
This work tackles the slow inference of diffusion models by distilling a pretrained multi-step teacher into a one-step conditional GAN, formulated as paired noise-to-image translation. It introduces E-LatentLPIPS, a latent-space perceptual loss with ensembled differentiable augmentations, and a multi-scale conditional diffusion discriminator initialized from the teacher to preserve alignment with text prompts. The resulting Diffusion2GAN achieves state-of-the-art one-step performance on COCO and SDXL benchmarks, outperforming prior distillation methods in FID and CLIP metrics while enabling interactive speeds. This approach enables faster, more scalable text-to-image generation, with potential benefits for real-time creative tools and broader applications, albeit with considerations around ethical use and dependence on teacher quality.
Abstract
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.
