Table of Contents
Fetching ...

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, Jong Chul Ye

TL;DR

CFG++ addresses the off-manifold drawbacks of classifier-free guidance by reframing text guidance as a diffusion-based inverse problem and applying a manifold-constrained, score-matching approach. The method replaces the traditional sharpened conditional denoising with an interpolation that uses unconditional denoising, yielding a simple yet effective update that remains on the data manifold. Empirically, CFG++ improves text-to-image quality, enables invertible DDIM inversion, and enhances performance on text-conditioned inverse problems, across both standard and distilled diffusion models. This approach offers more stable guidance at lower scales and broad applicability to diffusion-based editing and problem-solving tasks.

Abstract

Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Moreover, CFG++ can be easily integrated into high-order diffusion solvers and naturally extends to distilled diffusion models. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: https://cfgpp-diffusion.github.io/.

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

TL;DR

CFG++ addresses the off-manifold drawbacks of classifier-free guidance by reframing text guidance as a diffusion-based inverse problem and applying a manifold-constrained, score-matching approach. The method replaces the traditional sharpened conditional denoising with an interpolation that uses unconditional denoising, yielding a simple yet effective update that remains on the data manifold. Empirically, CFG++ improves text-to-image quality, enables invertible DDIM inversion, and enhances performance on text-conditioned inverse problems, across both standard and distilled diffusion models. This approach offers more stable guidance at lower scales and broad applicability to diffusion-based editing and problem-solving tasks.

Abstract

Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Moreover, CFG++ can be easily integrated into high-order diffusion solvers and naturally extends to distilled diffusion models. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: https://cfgpp-diffusion.github.io/.
Paper Structure (23 sections, 1 theorem, 32 equations, 21 figures, 4 tables, 4 algorithms)

This paper contains 23 sections, 1 theorem, 32 equations, 21 figures, 4 tables, 4 algorithms.

Key Result

proposition 1

Let $d{\boldsymbol z}({\boldsymbol x}_t) := {\boldsymbol z}({\boldsymbol x}_t) - {\boldsymbol z}({\boldsymbol x}_{t+1})$ denote the discrete time evolution of some random variable ${\boldsymbol z}$ at time $t$. Then, the evolution of $\textcolor{cfg}{\hat{{\boldsymbol x}}_{\boldsymbol c}^\omega}$ of where $\Delta({\boldsymbol x}_t, {\boldsymbol c}) := {\hat{{\boldsymbol x}}_{\boldsymbol c}}({\bold

Figures (21)

  • Figure 1: (Top) Comparison of T2I results by SDXL-Lightning for the prompt "kayak in the water, optical color, aerial view, rainbow". The CFG-guided image has significant artifacts, which are reduced in the CFG++ version. (Middle) DDIM Inversion results under CFG show noticeable artifacts at various CFG scales, which are significantly reduced by CFG++. (Bottom) The evolution of denoised estimates differs between CFG and CFG++. CFG exhibits sudden shifts and intense color saturation early in reverse diffusion, while CFG++ transitions smoothly from low to high-resolution.
  • Figure 2: Reverse Diffusion with CFG
  • Figure 3: Off-manifold phenomenon of CFG arise from: (a) the typical CFG scale $\omega > 1.0$ which leads to extrapolation and deviation from the piecewise linear data manifold, and (b) CFG's renoising process, which introduces a nonzero offset $\Delta^\omega$ from the correct manifold. CFG++ effectively mitigates all these artifacts.
  • Figure 4: Text-conditioned score matching loss throughout the reverse diffusion sampling for both CFG and CFG++ in SDXL. Avg. loss computed with 55 prompts from chen2024pixartalpha.
  • Figure 5: T2I using SDXL-Lightning, 6 NFE, CFG vs CFG++.
  • ...and 16 more figures

Theorems & Definitions (2)

  • proposition 1
  • proof