Table of Contents
Fetching ...

Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance

Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson

TL;DR

This paper identifies a fundamental limitation of classifier-free guidance (CFG): the linear CFG denoiser does not correspond to a valid denoising diffusion model for the tilted target distribution $p_0(x|c)^w p_0(x)$. It introduces a gradient of Rényi divergence as a repulsive term to restore consistency with a proper diffusion process and proposes Classifier-Free Gibbs-Like Guidance (CFGiG), a Gibbs-like sampling procedure that starts from the conditional model and iteratively refines samples via noising and CFG-based denoising to preserve diversity while improving quality. The authors provide theoretical analysis in a Gaussian setting, derive a two-noise-level tilted-score expression, and demonstrate substantial gains over CFG on both image and text-to-audio generation tasks. The work offers practical gains for conditional diffusion models and suggests training-time objectives that explicitly account for the Rényi term to reduce tuning and enable guidance across the sampling range. Overall, CFGiG presents a principled path to simultaneously improve perceptual quality and preserve sample diversity in conditional diffusion generation."

Abstract

Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig

Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance

TL;DR

This paper identifies a fundamental limitation of classifier-free guidance (CFG): the linear CFG denoiser does not correspond to a valid denoising diffusion model for the tilted target distribution . It introduces a gradient of Rényi divergence as a repulsive term to restore consistency with a proper diffusion process and proposes Classifier-Free Gibbs-Like Guidance (CFGiG), a Gibbs-like sampling procedure that starts from the conditional model and iteratively refines samples via noising and CFG-based denoising to preserve diversity while improving quality. The authors provide theoretical analysis in a Gaussian setting, derive a two-noise-level tilted-score expression, and demonstrate substantial gains over CFG on both image and text-to-audio generation tasks. The work offers practical gains for conditional diffusion models and suggests training-time objectives that explicitly account for the Rényi term to reduce tuning and enable guidance across the sampling range. Overall, CFGiG presents a principled path to simultaneously improve perceptual quality and preserve sample diversity in conditional diffusion generation."

Abstract

Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig

Paper Structure

This paper contains 42 sections, 12 theorems, 71 equations, 25 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

For any $\sigma > 0$, the scores associated with $\cpdata{\sigma}{\mathbf{c}; w}{}$ are

Figures (25)

  • Figure 1: Illustration of sample refinement across Gibbs iterations. Left, samples generated using EDM-XXL for two ImageNet classes: $291$ (top) and $967$ (bottom). Right, samples generated using Stable Diffusion XL (SDXL) for the prompts: "A black bear walking in the grass and leaves." (top) and "A dog jumping through the air above a pool of water that has been marked for distance, with people watching in the distance." (bottom). Each row displays an initial sample ${X^{\tiny\mathmbox{(0)}} _{0}}$ alongside two subsequent iterates ${X^{\tiny\mathmbox{(1)}} _{0}}, {X^{\tiny\mathmbox{(2)}} _{0}}$.
  • Figure 1: Comparison of average FID, FD$_\text{DINOv2}$, Precision/Recall, and Density/Coverage on ImageNet-$512$ for EDM2-S and EDM2-XXL.
  • Figure 2: Left: DDIM sampling with CFG denoiser \ref{['eq:cfg-denoiser']}; Right: DDIM sampling with the ideal denoiser \ref{['eq:tilted-score']}. The trajectories of $1000$ particles are represented with thin red lines and $5$ selected trajectories are displayed in black thick line along which scores are being depicted with arrows. The histogram of the simulated particles is represented in light gray. We also plot the ideal score \ref{['eq:tilted-score']} (arrow in black) with the contribution of both the CFG score (arrow in red) and the repulsive term arising from the Rényi divergence (arrow in blue).
  • Figure 2: Comparison of FAD, KL, and IS on AudioCaps test set for AudioLDM 2-Full-Large model.
  • Figure 3: Impact of the hyperparameters for the EDM2-S (top) and EDM2-XXL (bottom) models. The metrics are computed with 10k generated samples.
  • ...and 20 more figures

Theorems & Definitions (22)

  • Example 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition : Restatement of \ref{['prop:tilted-score']}
  • proof : Proof of \ref{['prop:tilted-score']}
  • Proposition : Restatement of \ref{['prop:delayed_guidance']}
  • proof : Proof of \ref{['prop:delayed_guidance']}
  • Corollary 1
  • ...and 12 more