Table of Contents
Fetching ...

Classifier-free Guidance with Adaptive Scaling

Dawid Malarz, Artur Kasymov, Maciej Zięba, Jacek Tabor, Przemysław Spurek

TL;DR

Classifier-free guidance (CFG) often faces a trade-off between prompt adherence and image quality. The authors introduce β-CFG, which combines gradient normalization with a time-dependent $β$-distribution to adapt guidance strength across diffusion steps, stabilizing the sampling trajectory and reducing outliers. Empirically, β-CFG achieves better FID scores while preserving text-to-image CLIP similarity comparable to standard CFG on tasks like COCO-based generation with SD models, and demonstrations on toy 2D data show closer alignment to the data manifold. The approach provides a practical, parameterizable framework for adaptive guidance that improves data-manifold alignment and sample quality without external classifiers.

Abstract

Classifier-free guidance (CFG) is an essential mechanism in contemporary text-driven diffusion models. In practice, in controlling the impact of guidance we can see the trade-off between the quality of the generated images and correspondence to the prompt. When we use strong guidance, generated images fit the conditioned text perfectly but at the cost of their quality. Dually, we can use small guidance to generate high-quality results, but the generated images do not suit our prompt. In this paper, we present $β$-CFG ($β$-adaptive scaling in Classifier-Free Guidance), which controls the impact of guidance during generation to solve the above trade-off. First, $β$-CFG stabilizes the effects of guiding by gradient-based adaptive normalization. Second, $β$-CFG uses the family of single-modal ($β$-distribution), time-dependent curves to dynamically adapt the trade-off between prompt matching and the quality of samples during the diffusion denoising process. Our model obtained better FID scores, maintaining the text-to-image CLIP similarity scores at a level similar to that of the reference CFG.

Classifier-free Guidance with Adaptive Scaling

TL;DR

Classifier-free guidance (CFG) often faces a trade-off between prompt adherence and image quality. The authors introduce β-CFG, which combines gradient normalization with a time-dependent -distribution to adapt guidance strength across diffusion steps, stabilizing the sampling trajectory and reducing outliers. Empirically, β-CFG achieves better FID scores while preserving text-to-image CLIP similarity comparable to standard CFG on tasks like COCO-based generation with SD models, and demonstrations on toy 2D data show closer alignment to the data manifold. The approach provides a practical, parameterizable framework for adaptive guidance that improves data-manifold alignment and sample quality without external classifiers.

Abstract

Classifier-free guidance (CFG) is an essential mechanism in contemporary text-driven diffusion models. In practice, in controlling the impact of guidance we can see the trade-off between the quality of the generated images and correspondence to the prompt. When we use strong guidance, generated images fit the conditioned text perfectly but at the cost of their quality. Dually, we can use small guidance to generate high-quality results, but the generated images do not suit our prompt. In this paper, we present -CFG (-adaptive scaling in Classifier-Free Guidance), which controls the impact of guidance during generation to solve the above trade-off. First, -CFG stabilizes the effects of guiding by gradient-based adaptive normalization. Second, -CFG uses the family of single-modal (-distribution), time-dependent curves to dynamically adapt the trade-off between prompt matching and the quality of samples during the diffusion denoising process. Our model obtained better FID scores, maintaining the text-to-image CLIP similarity scores at a level similar to that of the reference CFG.

Paper Structure

This paper contains 20 sections, 18 equations, 12 figures, 3 tables, 4 algorithms.

Figures (12)

  • Figure 1: A two-dimensional distribution featuring two classes represented by gray and orange regions. (a) Ground truth samples from the orange class. (b) Conditional sampling with no additional guidance techniques. (c) Classifier-free guidance decreases sample diversity to achieve outlier removal (d) $\beta$-CFG preserves the diversity of the samples while still achieving the objective of outlier removal.
  • Figure 2: Norm values of the modification factor applied at each iteration of the classifier-free guided diffusion sampling backward process. We compare classical CFG and our solution $\beta$-CFG. We model such trajectory by $\beta$-distribution and parameter $\gamma$. $\beta$-distribution gives the general trend of a diffusion process. For $\gamma=1$ we have a pure Gamma curve while by going with Gamma to zero, add local perturbation from pure CFG. Thanks to the $\beta$-distribution, we have no guidance at the beginning and at the end of trajectory.
  • Figure 3: Comparison of CFG and $\beta$-CFG. As we can see, our model produces more realistic images, which is consistent with the numerical results from Tab. \ref{['tab:all']}.
  • Figure 4: Ablation study of our models on data generated for the prompt: "beautiful lady, freckles, big smile, blue eyes, short ginger hair, wearing a floral blue vest top, soft light, dark gray background." Thanks to the $\beta$-distribution, we can model how the diffusion trajectory behaves near data manifolds.
  • Figure 5: The the evolution of denoised estimates differs between CFG and $\beta$-CFG. Both methods behave in a similar way at the beginning of the trajectory. However, $\beta$-CFG converges faster to the data manifold to produce an image that is more consistent with the prompt: "a shoe rack with some shoes and a dog sleeping on them".
  • ...and 7 more figures