Table of Contents
Fetching ...

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu

TL;DR

This work identifies spatial inconsistencies in global classifier-free guidance (CFG) for text-to-image diffusion and introduces Semantic-aware CFG (S-CFG). S-CFG builds training-free semantic maps from cross- and self-attention within the U-net to partition latent space into region masks, then applies adaptive, region-specific CFG scales to equalize guidance across semantic units. The method yields improved image fidelity and text–image alignment across multiple diffusion models and samplers, without extra training cost, and enhances downstream tasks like ControlNet and DreamBooth. Overall, S-CFG provides a robust, region-aware alternative to global CFG that improves generation quality and consistency in practice.

Abstract

Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

TL;DR

This work identifies spatial inconsistencies in global classifier-free guidance (CFG) for text-to-image diffusion and introduces Semantic-aware CFG (S-CFG). S-CFG builds training-free semantic maps from cross- and self-attention within the U-net to partition latent space into region masks, then applies adaptive, region-specific CFG scales to equalize guidance across semantic units. The method yields improved image fidelity and text–image alignment across multiple diffusion models and samplers, without extra training cost, and enhances downstream tasks like ControlNet and DreamBooth. Overall, S-CFG provides a robust, region-aware alternative to global CFG that improves generation quality and consistency in practice.

Abstract

Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.
Paper Structure (29 sections, 16 equations, 11 figures, 8 tables)

This paper contains 29 sections, 16 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: A motivation example. The first line shows images generated by Stable Diffusion with CFG and S-CFG, where the prompt is "a photo of an astronaut riding a horse" and the segmentation maps are manually labeled (Ground, Sky, Horse, Astronaut). The below line shows the average norm curves of the estimated classifier score $\ \nabla_{x_t} \log p(c|x_t)$ (solid line) and diffusion score $\nabla_{x_t} \log p(x_t)$ (dashed line) in each semantic region. The Y-axis scale unit is set as the dynamic variance parameter $\sigma_t$ for better illustrations without damaging the conclusion.
  • Figure 2: The overall framework of our S-CFG method. At each denoising step in diffusion models, the U-net backbone estimates both diffusion score $\nabla_{x_t} \log p(x_t)$ and conditional diffusion score $\nabla_{x_t} \log p(x_t|c)$ without or with text prompt input, which can further infer the classifier score $\nabla_{x_t} \log p(c| x_t)$. By extracting and exploiting self-attention map $S^k_t$ and cross-attention map $C^k_t$ in each attention layer of U-net, we can obtain the region masks $m_{t,i}$ for each prompt token $i$. With the goal of unifying the classifier score norm in different regions, the CFG scale map can be determined to control the semantic strengths spatially in the following step.
  • Figure 3: The latent image segmentation based on attention maps at different denoising steps. The first column shows the predicted image $x_0$ based on the current latent image $x_t$ and noise estimation $\epsilon_{\theta}$ with Equation \ref{['equ:xtx0']}. The following three columns show the semantic segmentation maps with different strategies. Regions labeled by different colors correspond to different tokens. The last column shows the foreground mask detected by our approach.
  • Figure 4: The qualitative evaluation results on the trade-off curve of FID-30K VS CLIP Score.
  • Figure 5: Samples generated by different base models with CFG (left) or S-CFG (right).
  • ...and 6 more figures