Table of Contents
Fetching ...

Navigating with Annealing Guidance Scale in Diffusion Space

Shai Yehezkel, Omer Dahary, Andrey Voynov, Daniel Cohen-Or

TL;DR

This work proposes an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal, and significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation.

Abstract

Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Navigating with Annealing Guidance Scale in Diffusion Space

TL;DR

This work proposes an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal, and significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation.

Abstract

Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Paper Structure

This paper contains 31 sections, 14 equations, 24 figures, 7 tables, 4 algorithms.

Figures (24)

  • Figure 1: Our annealing guidance scheduler significantly enhances image quality and alignment with the text prompt.
  • Figure 2: Guidance Scale Over Time. Top: Guidance scale trajectories for two prompts: A and B. CFG++ uses a constant scale for both prompts, while our annealing scheduler dynamically adapts the scale per prompt. CFG is omitted from the plot for clarity but uses a fixed scale of $w = 10$. Bottom: Comparison of generations from CFG (left), CFG++ (center) and our method (right). Our scheduler improves both quality and alignment: resolving visual artifacts (distorted hands, scene A) and correcting object counts (scene B).
  • Figure 3: Classifier-Free Guidance step. The denoising step of a sample $z_{t}$ is illustrated as a linear combination of the conditional noise prediction $\epsilon_{t}^c$ and the unconditional noise prediction $\epsilon_{t}^\varnothing$. We denote the difference between predictions as $\delta_t = \epsilon_t^c - \epsilon_t^\varnothing$. The dashed line represents possible $z_{t-1}$ predictions using CFG. For simplicity, we do not depict the rescaling of $z_t$ which is performed at each denoising step. Here, $z_{t-1}^{(1)}$ and $z_{t-1}^{(2)}$ denote predictions corresponding to two different guidance scales, $w_1$ and $w_2$, respectively. The blue manifold represents the density $p_t(z)$, while the orange manifold illustrates the conditional density $p_t(z | c)$.
  • Figure 4: Geometric intuition of $\delta_t$. A 2D illustration showing how the magnitude of $\delta_t = \epsilon_t^c - \epsilon_t^{\varnothing}$ reflects alignment with the prompt. At time $t$, the sample $z_{0|t}$ lies near mode A, which partially aligns with the prompt, resulting in a small $\|\delta_t\|$. As the denoising progresses, following the direction of $\delta_t$ leads toward mode B that better aligns with the prompt. The candidate points along the line correspond to different guidance scales $w$. Among these, $z^{(2)}_{0|t-1}$ lies closest to mode B, where the conditional and unconditional predictions are best aligned, yielding a minimal $\|\delta_{t-1}\|$.
  • Figure 5: Heatmaps showing the predicted guidance scale $w_\theta$ as a function of timestep $t$ and $\|\delta_t\|$, for three values of $\lambda$. The color represents the value of $w_\theta \left( t, \|\delta_t\|, \lambda \right)$, with the colormap shown on the right. Larger $t$ corresponds to earlier diffusion steps, with $t = 0$ marking the end of denoising. At each step, $\|\delta_t\|$ is recomputed and used to dynamically predict the guidance scale, forming a trajectory over time as demonstrated in Fig. \ref{['fig:sds_scale_over_time']}.
  • ...and 19 more figures