Table of Contents
Fetching ...

Guiding Diffusion Models with Semantically Degraded Conditions

Shilong Han, Yuming Zhang, Hongxia Wang

TL;DR

Condition-Degradation Guidance is proposed, a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$, and markedly improves compositional accuracy and text-image alignment.

Abstract

Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}_{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.

Guiding Diffusion Models with Semantically Degraded Conditions

TL;DR

Condition-Degradation Guidance is proposed, a novel paradigm that replaces the null prompt with a strategically degraded condition, , and markedly improves compositional accuracy and text-image alignment.

Abstract

Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt () generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, . This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.
Paper Structure (52 sections, 13 equations, 16 figures, 9 tables, 2 algorithms)

This paper contains 52 sections, 13 equations, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: Qualitative comparison between Classifier-Free Guidance (CFG) and our Condition-Degradation Guidance (CDG) across three state-of-the-art models (SD3, SD3.5, and Flux). These examples demonstrate CDG's superior capability in handling complex compositional prompts where CFG often fails. CDG consistently outperforms CFG in accurate text rendering, precise spatial relationships and attribute binding, as well as complex object interactions.
  • Figure 2: CDG synthesizes a geometrically superior guidance signal compared to CFG. (Top) Geometric Decoupling: CDG maintains near-perfect orthogonality throughout generation, while CFG suffers from significant early-stage entanglement. (Bottom) Interference Energy Ratio: CDG exhibits minimal interference, in stark contrast to CFG's substantial energy waste in misaligned directions. Together, these analyses demonstrate that CDG's guidance signal is structurally cleaner and more efficient from its inception, explaining its enhanced compositional control.
  • Figure 3: An illustration of our proposed pipeline for constructing the semantically degraded condition, $\boldsymbol{c}_{\text{deg}}$. The process begins with attention graph extraction (a--c), where the self-attention map (b) from a transformer block (a) is modeled as a graph (c). Next, the Weighted PageRank (WPR) algorithm is applied to compute an importance score for each token (d). Following our Stratified Degradation strategy, these scores are used to generate a binary mask $\boldsymbol{m}$ (e). Finally, the mask facilitates the construction of $\boldsymbol{c}_{\text{deg}}$ via masked interpolation (f) between the original condition $\boldsymbol{c}$ and the null condition $\emptyset$.
  • Figure 4: WPR reveals a clear importance dichotomy between content and context-aggregating tokens: content tokens carry fine-grained semantics while context-aggregating tokens carry coarse-grained semantics, as exemplified by the prompt "A man is cooking, MineCraft Style." (a) The stem plot shows that high importance scores (red for CLIP, cyan for T5) are almost exclusively concentrated on semantic content tokens. (b) The ranked list confirms that the top tokens ("minecraft", "cooking", "man") are almost all content-related. This dichotomy motivates our Stratified Degradation strategy, which first degrades content tokens and then context-aggregating tokens for controllable semantic degradation.
  • Figure 5: Hyperparameter analysis: joint effect of intervention block ($\lambda_{\text{block}}$) and Degradation Ratio ($R_{\text{deg}}$) on SD3.
  • ...and 11 more figures