TCFG: Tangential Damping Classifier-free Guidance
Mingi Kwon, Shin seong Kim, Jaeseok Jeong. Yi Ting Hsiao, Youngjung Uh
TL;DR
This work addresses CFG misalignment between unconditional and conditional scores in diffusion-based text-to-image synthesis. It introduces Tangential Damping Classifier-free Guidance (TCFG), which applies SVD to the score pair matrix to remove tangential, misaligned components and project the unconditional score onto the dominant normal direction of the conditional manifold, yielding a refined guidance signal. The approach achieves consistent FID improvements across multiple diffusion models (SD v1.5, SDXL, SD v3) and DiT on ImageNet, with negligible computational overhead and stable CLIP scores. By revealing and exploiting the manifold/tangent-space structure of diffusion scores, TC-FG enhances conditional image quality, reduces overexposure biases, and remains compatible with existing CFG improvements and high-resolution generation scenarios.
Abstract
Diffusion models have achieved remarkable success in text-to-image synthesis, largely attributed to the use of classifier-free guidance (CFG), which enables high-quality, condition-aligned image generation. CFG combines the conditional score (e.g., text-conditioned) with the unconditional score to control the output. However, the unconditional score is in charge of estimating the transition between manifolds of adjacent timesteps from $x_t$ to $x_{t-1}$, which may inadvertently interfere with the trajectory toward the specific condition. In this work, we introduce a novel approach that leverages a geometric perspective on the unconditional score to enhance CFG performance when conditional scores are available. Specifically, we propose a method that filters the singular vectors of both conditional and unconditional scores using singular value decomposition. This filtering process aligns the unconditional score with the conditional score, thereby refining the sampling trajectory to stay closer to the manifold. Our approach improves image quality with negligible additional computation. We provide deeper insights into the score function behavior in diffusion models and present a practical technique for achieving more accurate and contextually coherent image synthesis.
