Table of Contents
Fetching ...

REG: Rectified Gradient Guidance for Conditional Diffusion Models

Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S. Boning

TL;DR

This work resolves a long-standing gap between guidance practice and theory in conditional diffusion by reframing guidance as joint distribution scaling of the denoising chain, introducing a formally valid objective $ar{p}_ heta(oldsymbol{x}_{0:T}|oldsymbol{y}) propto p_ heta(oldsymbol{x}_{0:T}|oldsymbol{y}) R_0(oldsymbol{x}_0,oldsymbol{y})$. It proves marginal-scaling interpretations are invalid and derives the unique joint-scaled transitions governed by $E_t(oldsymbol{x}_t,oldsymbol{y})$, with an updated noise predictor $ar{oldsymbol{60}}^ ext{*}_{ heta,t}$. Building on this theory, the paper proposes Rectified Gradient Guidance (REG), a practical gradient-correction that approximates the optimal joint-scaling solution without foresight into the future. Extensive experiments on 1D/2D synthetic tasks, class-conditional ImageNet, and text-to-image generation show REG consistently improves FID and IS/CLIP scores across diverse guidance methods, with modest runtime/memory overhead. The results offer a principled, scalable pathway to enhance conditional diffusion models in real-world applications while clarifying foundational misunderstandings about guidance.

Abstract

Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.

REG: Rectified Gradient Guidance for Conditional Diffusion Models

TL;DR

This work resolves a long-standing gap between guidance practice and theory in conditional diffusion by reframing guidance as joint distribution scaling of the denoising chain, introducing a formally valid objective . It proves marginal-scaling interpretations are invalid and derives the unique joint-scaled transitions governed by , with an updated noise predictor . Building on this theory, the paper proposes Rectified Gradient Guidance (REG), a practical gradient-correction that approximates the optimal joint-scaling solution without foresight into the future. Extensive experiments on 1D/2D synthetic tasks, class-conditional ImageNet, and text-to-image generation show REG consistently improves FID and IS/CLIP scores across diverse guidance methods, with modest runtime/memory overhead. The results offer a principled, scalable pathway to enhance conditional diffusion models in real-world applications while clarifying foundational misunderstandings about guidance.

Abstract

Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.

Paper Structure

This paper contains 28 sections, 3 theorems, 41 equations, 10 figures, 5 tables.

Key Result

Theorem 4.1

To satisfy the scaled goal given in Eq. (eq:scaled_goal_joint), we must have an unique set of transition kernels: where $t=0,1,\cdots,T$ and $\mathbf{x}_T=\varnothing$, which also determines: It implies the noise prediction network should be:

Figures (10)

  • Figure 1: Left: Guidance values are plotted along the X-axis in the range $[-1.0,2.0]$ at time step $t=13$. Right: Heatmaps depict the absolute differences between each gradient guidance value and the optimal guidance $\nabla \log E_t$, plotted on uniform grids in $[-1.0,2.0]$ at each time step. These two figures justify that our proposed REG aligns better with the optimal guidance $\nabla \log E_t$ compared to the vanilla CFG, i.e., $\nabla \log R_t$ without REG.
  • Figure 2: Results of guidance on a synthetic 2D two-class conditional generation task using a simple diffusion model with 25 time steps. (a)-(c) illustrate the target shape to be learned, the shape generated using CFG with our proposed REG, and the shape generated without REG, respectively. (d)-(f) depict $\nabla \log E_t$ and $\nabla \log R_t$ at $t=9$, which are gradients of a scalar with respect to a 2D vector, visualized as arrows in a 2D plane. (g)-(h) show the magnitude of the REG correction term (i.e., the second line of Eq. (\ref{['eq:epsilon_bar_ours']})) at $t=9$.
  • Figure 3: The Pareto front of FID versus IS is presented by varying the guidance weight $w$ over a broad range for different methods. The turning points are also included. Curves positioned further toward the bottom-right indicate superior performance. See Appendix \ref{['sec:appendix_additional_quantitative_results']} for extra Pareto front results.
  • Figure 4: The Pareto front of FID versus CLIP score is shown by varying the guidance weight $w$ across a broad range for different methods. Curves closer to the bottom-right are better.
  • Figure 5: Guidance values are plotted along the X-axis in the range $[-1.0,2.0]$ at different time steps.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3