Guided Score identity Distillation for Data-Free One-Step Text-to-Image Generation
Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, Hai Huang
TL;DR
This work addresses slow sampling in diffusion-based text-to-image generation under data constraints by introducing data-free Guided Score Identity Distillation with Long-Short classifier-free Guidance (SiD-LSG). It distills pretrained Stable Diffusion models into one-step generators without real data, using a model-based explicit score-matching loss and CFG applied to both the teacher and fake-score networks. On SD1.5 and SD2.1-base, SiD-LSG achieves state-of-the-art data-free one-step FID on COCO-2014 (e.g., $8.15$) with competitive CLIP scores, and favorable human preference metrics, while offering a flexible CFG design via long, short, and long-short strategies. The approach reduces computational costs for deployment and preserves alignment with textual prompts, with future work exploring real-data integration and diffusion-GAN adversarial training to push performance further.
Abstract
Diffusion-based text-to-image generation models trained on extensive text-image pairs have demonstrated the ability to produce photorealistic images aligned with textual descriptions. However, a significant limitation of these models is their slow sample generation process, which requires iterative refinement through the same network. To overcome this, we introduce a data-free guided distillation method that enables the efficient distillation of pretrained Stable Diffusion models without access to the real training data, often restricted due to legal, privacy, or cost concerns. This method enhances Score identity Distillation (SiD) with Long and Short Classifier-Free Guidance (LSG), an innovative strategy that applies Classifier-Free Guidance (CFG) not only to the evaluation of the pretrained diffusion model but also to the training and evaluation of the fake score network. We optimize a model-based explicit score matching loss using a score-identity-based approximation alongside our proposed guidance strategies for practical computation. By exclusively training with synthetic images generated by its one-step generator, our data-free distillation method rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Notably, the one-step distillation of Stable Diffusion 1.5 achieves an FID of 8.15 on the COCO-2014 validation set, a record low value under the data-free setting. Our code and checkpoints are available at https://github.com/mingyuanzhou/SiD-LSG.
