Table of Contents
Fetching ...

DICE: Distilling Classifier-Free Guidance into Text Embeddings

Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, Siwei Lyu

TL;DR

This work tackles the computational burden of Classifier-Free Guidance in text-to-image diffusion by introducing DICE, a lightweight embedding sharpener trained under CFG supervision. By decoupling the sharpener from the diffusion model, DICE achieves CFG-like image quality with unguided sampling at roughly half the computational cost. The method emphasizes sharpening padding components of text embeddings while preserving semantic content, and demonstrates strong generalization across SD15, SDXL, and PixArt-$ extless alpha extgreater$, including handling of negative prompts. Together, these results indicate a practical, scalable path to high-fidelity, efficiently guided image generation in diverse diffusion frameworks.

Abstract

Text-to-image diffusion models are capable of generating high-quality images, but suboptimal pre-trained text representations often result in these images failing to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, CFG introduces significant computational overhead. In this paper, we present DIstilling CFG by sharpening text Embeddings (DICE) that replaces CFG in the sampling process with half the computational complexity while maintaining similar generation quality. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Furthermore, examining the enhancement pattern, we identify the underlying mechanism of DICE that sharpens specific components of text embeddings to preserve semantic information while enhancing fine-grained details. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL, and PixArt-$α$ demonstrate the effectiveness of our method. Code is available at https://github.com/zju-pi/dice.

DICE: Distilling Classifier-Free Guidance into Text Embeddings

TL;DR

This work tackles the computational burden of Classifier-Free Guidance in text-to-image diffusion by introducing DICE, a lightweight embedding sharpener trained under CFG supervision. By decoupling the sharpener from the diffusion model, DICE achieves CFG-like image quality with unguided sampling at roughly half the computational cost. The method emphasizes sharpening padding components of text embeddings while preserving semantic content, and demonstrates strong generalization across SD15, SDXL, and PixArt-, including handling of negative prompts. Together, these results indicate a practical, scalable path to high-fidelity, efficiently guided image generation in diverse diffusion frameworks.

Abstract

Text-to-image diffusion models are capable of generating high-quality images, but suboptimal pre-trained text representations often result in these images failing to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, CFG introduces significant computational overhead. In this paper, we present DIstilling CFG by sharpening text Embeddings (DICE) that replaces CFG in the sampling process with half the computational complexity while maintaining similar generation quality. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Furthermore, examining the enhancement pattern, we identify the underlying mechanism of DICE that sharpens specific components of text embeddings to preserve semantic information while enhancing fine-grained details. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL, and PixArt- demonstrate the effectiveness of our method. Code is available at https://github.com/zju-pi/dice.

Paper Structure

This paper contains 21 sections, 7 equations, 22 figures, 6 tables, 1 algorithm.

Figures (22)

  • Figure 1: Comparison of text-to-image generation: unguided sampling, guided sampling, and DICE. Top: Average aesthetic score schuhmann2022laion over $5,000$ images plotted against the number of function evaluations (NFE). Bottom: An example of image synthesis using different methods at NFE = 4, 8, 12, and 16.
  • Figure 2: Overview of DICE sampling and comparison with traditional unguided and guided sampling. With sharpened text embeddings, DICE achieves high-quality image generation comparable to guided sampling while maintaining the same computational overhead as unguided sampling.
  • Figure 3: Text-image alignment with scaled text embeddings. Images are generated by DreamShaper DreamShaper, a popular variant of Stable Diffusion v1.5 rombach2022ldm with a CLIP text encoder radford2021learning, and PixArt-$\alpha$chen2024pixart with a T5-XXL text encoder raffel2020exploring. Left: Text embeddings are scaled by a factor $s$ and images are generated via unguided and guided sampling. Right: A grid search is conducted to identify the optimal scaling factor with respect to the CLIP score (CS) and Aesthetic score (AS). An optimal scaling factor improves the sample quality but varies across model. Meanwhile, naive scaling is insufficient to improve unguided sampling to the image quality achieved by guided sampling, which necessitates exploring the embedding space for a fine-grained dynamic scaling. Prompt: "An epic landscape".
  • Figure 4: Top: a text embedding consists of a semantic and padding embedding. Bottom: replacing the original text embedding with the sharpened semantic and padding embedding. The latter one largely improves the sample quality.
  • Figure 5: Qualitative results with different model capacities, image styles and network architectures. Images are generated by 20-step DPM-Solver++ lu2022dpmpp on 7 text-to-image models including multiple SD15 variants rombach2022ldm, SDXL podell2024sdxl and Pixart-$\alpha$chen2024pixart. The used prompts are provided in Section \ref{['sec:app_details']}.
  • ...and 17 more figures