Table of Contents
Fetching ...

Guided Score identity Distillation for Data-Free One-Step Text-to-Image Generation

Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, Hai Huang

TL;DR

This work addresses slow sampling in diffusion-based text-to-image generation under data constraints by introducing data-free Guided Score Identity Distillation with Long-Short classifier-free Guidance (SiD-LSG). It distills pretrained Stable Diffusion models into one-step generators without real data, using a model-based explicit score-matching loss and CFG applied to both the teacher and fake-score networks. On SD1.5 and SD2.1-base, SiD-LSG achieves state-of-the-art data-free one-step FID on COCO-2014 (e.g., $8.15$) with competitive CLIP scores, and favorable human preference metrics, while offering a flexible CFG design via long, short, and long-short strategies. The approach reduces computational costs for deployment and preserves alignment with textual prompts, with future work exploring real-data integration and diffusion-GAN adversarial training to push performance further.

Abstract

Diffusion-based text-to-image generation models trained on extensive text-image pairs have demonstrated the ability to produce photorealistic images aligned with textual descriptions. However, a significant limitation of these models is their slow sample generation process, which requires iterative refinement through the same network. To overcome this, we introduce a data-free guided distillation method that enables the efficient distillation of pretrained Stable Diffusion models without access to the real training data, often restricted due to legal, privacy, or cost concerns. This method enhances Score identity Distillation (SiD) with Long and Short Classifier-Free Guidance (LSG), an innovative strategy that applies Classifier-Free Guidance (CFG) not only to the evaluation of the pretrained diffusion model but also to the training and evaluation of the fake score network. We optimize a model-based explicit score matching loss using a score-identity-based approximation alongside our proposed guidance strategies for practical computation. By exclusively training with synthetic images generated by its one-step generator, our data-free distillation method rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Notably, the one-step distillation of Stable Diffusion 1.5 achieves an FID of 8.15 on the COCO-2014 validation set, a record low value under the data-free setting. Our code and checkpoints are available at https://github.com/mingyuanzhou/SiD-LSG.

Guided Score identity Distillation for Data-Free One-Step Text-to-Image Generation

TL;DR

This work addresses slow sampling in diffusion-based text-to-image generation under data constraints by introducing data-free Guided Score Identity Distillation with Long-Short classifier-free Guidance (SiD-LSG). It distills pretrained Stable Diffusion models into one-step generators without real data, using a model-based explicit score-matching loss and CFG applied to both the teacher and fake-score networks. On SD1.5 and SD2.1-base, SiD-LSG achieves state-of-the-art data-free one-step FID on COCO-2014 (e.g., ) with competitive CLIP scores, and favorable human preference metrics, while offering a flexible CFG design via long, short, and long-short strategies. The approach reduces computational costs for deployment and preserves alignment with textual prompts, with future work exploring real-data integration and diffusion-GAN adversarial training to push performance further.

Abstract

Diffusion-based text-to-image generation models trained on extensive text-image pairs have demonstrated the ability to produce photorealistic images aligned with textual descriptions. However, a significant limitation of these models is their slow sample generation process, which requires iterative refinement through the same network. To overcome this, we introduce a data-free guided distillation method that enables the efficient distillation of pretrained Stable Diffusion models without access to the real training data, often restricted due to legal, privacy, or cost concerns. This method enhances Score identity Distillation (SiD) with Long and Short Classifier-Free Guidance (LSG), an innovative strategy that applies Classifier-Free Guidance (CFG) not only to the evaluation of the pretrained diffusion model but also to the training and evaluation of the fake score network. We optimize a model-based explicit score matching loss using a score-identity-based approximation alongside our proposed guidance strategies for practical computation. By exclusively training with synthetic images generated by its one-step generator, our data-free distillation method rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Notably, the one-step distillation of Stable Diffusion 1.5 achieves an FID of 8.15 on the COCO-2014 validation set, a record low value under the data-free setting. Our code and checkpoints are available at https://github.com/mingyuanzhou/SiD-LSG.
Paper Structure (16 sections, 12 equations, 36 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 36 figures, 4 tables, 1 algorithm.

Figures (36)

  • Figure 2: Rapid advancements in distilling Stable Diffusion 1.5 are showcased by the proposed SiD method that incorporates long-short guidance (LSG). Key parameters include a batch size of 512, a learning rate of 1e-6, and an LSG scale of 2. This data-free approach achieves a zero-shot FID of 9.56 on the COCO-2014 validation set, along with a competitive CLIP score of 0.313. By reducing the LSG scale to 1.5, the FID can be further lowered to a record 8.15 among data-free diffusion distillation models, with a corresponding CLIP score of 0.304. The series of images, generated from the same set of random noises post-training the SiD generator with varying counts of synthesized images, illustrates progressions at 0, 0.02, 0.1, 0.2, 0.5, 1, 2, 3, 4, and 5 million images. These are equivalent to 0, 40, 200, 400, 1k, 2k, 4K, 6K, 8k, and 10k training iterations respectively, organized from the top left to the bottom right. The progression of FIDs and CLIPs is detailed in the orange solid curves in the left plot of Fig. \ref{['fig:lsg']}. The corresponding COCO-2014 validation text prompts are listed in Appendix \ref{['sec:prompts']}.
  • Figure 3: Left (Long CFG of the true score network): This plot illustrates the gradual decline in FID and the corresponding rise in CLIP scores, each influenced by different CFGs applied to the true score network. $\kappa$ values not specified in the legend are set to 1. FID scores are plotted on the primary y-axis, while CLIP scores are displayed on the secondary y-axis in corresponding line styles but with slight transparency. Together, these lines demonstrate how various CFGs impact model performance. Right (No CFG; Short CFG of the fake score network with $\kappa_2=\kappa_3\in(0,1)$; a simple form of LSG that sets $\kappa_1>1$): Analogous plot to the left where the CFGs of the fake score network are not applied, weakened during evaluation, or enhanced during training.
  • Figure 5: This figure illustrates the progression of FID and CLIP scores during an ablation study of distilling SD1.5 using SiD-LSG. The default settings of batch size 512, learning rate 1e-6, LSG scale 2, and Prompt Aesthetics6+ are maintained unless specified otherwise. Left: The number of training fake images is doubled from 10M to 20M under LSG scales of 1.5 and 2.0. Middle: Variations in batch size and learning rate settings under LSG 1.5. Right: Comparison of training prompts Aesthetics6+, Aesthetics6.25+, and Aesthetics6.5+.
  • Figure 6: Qualitative comparison of one-step distillation methods using identical text prompts and random seeds.
  • Figure 7: Visual comparison of two SiD-LSG models: one preferred for FID and the other for CLIP. All images are generated from the same text prompt: "A distinguished older gentleman in a vintage study, surrounded by books and dim lighting, his face marked by wisdom and time. 8K, hyper-realistic, cinematic, post-production." The model with a lower guidance scale of $\kappa=1.5$, which achieves a record-low one-step-generation FID of 8.15 and a competitive CLIP score of 0.304, produces images that are more diverse but align less closely with specific text details, such as "dim lighting." Conversely, the model with a higher guidance scale of $\kappa=4.5$, achieving a high CLIP score of 0.322 and noted for state-of-the-art human preference scores (HPSv2) as shown in Table 2, presents a relatively high FID of 16.54, indicating less diversity but superior text alignment and visual quality.
  • ...and 31 more figures