Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance
Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson
TL;DR
This paper identifies a fundamental limitation of classifier-free guidance (CFG): the linear CFG denoiser does not correspond to a valid denoising diffusion model for the tilted target distribution $p_0(x|c)^w p_0(x)$. It introduces a gradient of Rényi divergence as a repulsive term to restore consistency with a proper diffusion process and proposes Classifier-Free Gibbs-Like Guidance (CFGiG), a Gibbs-like sampling procedure that starts from the conditional model and iteratively refines samples via noising and CFG-based denoising to preserve diversity while improving quality. The authors provide theoretical analysis in a Gaussian setting, derive a two-noise-level tilted-score expression, and demonstrate substantial gains over CFG on both image and text-to-audio generation tasks. The work offers practical gains for conditional diffusion models and suggests training-time objectives that explicitly account for the Rényi term to reduce tuning and enable guidance across the sampling range. Overall, CFGiG presents a principled path to simultaneously improve perceptual quality and preserve sample diversity in conditional diffusion generation."
Abstract
Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig
