REG: Rectified Gradient Guidance for Conditional Diffusion Models
Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S. Boning
TL;DR
This work resolves a long-standing gap between guidance practice and theory in conditional diffusion by reframing guidance as joint distribution scaling of the denoising chain, introducing a formally valid objective $ar{p}_ heta(oldsymbol{x}_{0:T}|oldsymbol{y}) propto p_ heta(oldsymbol{x}_{0:T}|oldsymbol{y}) R_0(oldsymbol{x}_0,oldsymbol{y})$. It proves marginal-scaling interpretations are invalid and derives the unique joint-scaled transitions governed by $E_t(oldsymbol{x}_t,oldsymbol{y})$, with an updated noise predictor $ar{oldsymbol{60}}^ ext{*}_{ heta,t}$. Building on this theory, the paper proposes Rectified Gradient Guidance (REG), a practical gradient-correction that approximates the optimal joint-scaling solution without foresight into the future. Extensive experiments on 1D/2D synthetic tasks, class-conditional ImageNet, and text-to-image generation show REG consistently improves FID and IS/CLIP scores across diverse guidance methods, with modest runtime/memory overhead. The results offer a principled, scalable pathway to enhance conditional diffusion models in real-world applications while clarifying foundational misunderstandings about guidance.
Abstract
Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.
