Table of Contents
Fetching ...

Guiding a Diffusion Model with a Bad Version of Itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine

TL;DR

This work scrutinizes classifier-free guidance (CFG) in diffusion models, showing that CFG entangles image quality with limited variation. It proposes autoguidance, a simple method that uses a weaker version of the same model to guide a stronger one, thereby boosting image fidelity without sacrificing diversity. Empirically, autoguidance achieves state-of-the-art FID and FDDINOv2 on ImageNet at 512 and 64 pixels, and improves unconditional generation as well; ablations highlight the importance of independent EMA settings and the nature of the guiding degradations. The study includes synthetic degradation tests and qualitative analyses, and releases code and models, expanding the guiding-design space for diffusion-based synthesis.

Abstract

The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.

Guiding a Diffusion Model with a Bad Version of Itself

TL;DR

This work scrutinizes classifier-free guidance (CFG) in diffusion models, showing that CFG entangles image quality with limited variation. It proposes autoguidance, a simple method that uses a weaker version of the same model to guide a stronger one, thereby boosting image fidelity without sacrificing diversity. Empirically, autoguidance achieves state-of-the-art FID and FDDINOv2 on ImageNet at 512 and 64 pixels, and improves unconditional generation as well; ablations highlight the importance of independent EMA settings and the nature of the guiding degradations. The study includes synthetic degradation tests and qualitative analyses, and releases code and models, expanding the guiding-design space for diffusion-based synthesis.

Abstract

The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.
Paper Structure (25 sections, 13 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 25 sections, 13 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: A fractal-like 2D distribution with two classes indicated with gray and orange regions. Approximately 99% of the probability mass is inside the shown contours. (a) Ground truth samples drawn directly from the orange class distribution. (b) Conditional sampling using a small denoising diffusion model generates outliers. (c) Classifier-free guidance ($w=4$) eliminates outliers but reduces diversity by over-emphasizing the class. (d) Naive truncation via lengthening the score vectors. (e) Our method concentrates samples on high-probability regions without reducing diversity.
  • Figure 2: Closeup of the region highlighted in Figure \ref{['fig:toyCFG']}. (a) The implied learned density $p_1(\mathbf{x} | \mathbf{c}; \sigma_\text{mid})$ (green) at an intermediate noise level $\sigma_\text{mid}$ and its score vectors (log-gradients), plotted at representative sample points. The learned density approximates the underlying ground truth $p(\mathbf{x} | \mathbf{c}; \sigma_\text{mid})$ (orange) but fails to replicate its sharper details. (b) The weaker unconditional model learns a further spread-out density $p_0(\mathbf{x}; \sigma_\text{mid})$ (red) with a looser fit to the data. (c) Guidance moves the points according to the gradient of the (log) ratio of the two learned densities (blue). As the higher-quality model is more sharply concentrated at the data, this field tends inward towards the data distribution. The corresponding gradient is simply the difference of respective gradients in (a) and (b), illustrated at selected points. (d) Sampling trajectories taken by standard unguided diffusion following the learned score $\nabla_\mathbf{x} \log p_1(\mathbf{x} | \mathbf{c}; \sigma)$, from noise level $\sigma_\text{mid}$ to $0$. The contours (orange) represent the ground truth noise-free density. (e) Guidance introduces an additional force shown in (c), causing the points to concentrate at the core of the data density during sampling.
  • Figure 3: Example results for the Tree frog, Palace, Mushroom, Castle classes of ImageNet-512 using EDM2-S. Guidance weight increases to the right; rows are classifier-free guidance and our method.
  • Figure 4: Results for DeepFloyd IF DeepFloyd using the prompt "A blue jay standing on a large basket of rainbow macarons". The rows correspond to guidance weights $w \in \{1, 2, 3, 4\}$. The leftmost column shows results for CFG and the rightmost for autoguidance (XL-sized model guided by M-sized one). The middle columns correspond to blending between the two. See Appendix \ref{['app:deepFloydExamples']} for more examples.
  • Figure 5: Additional results for DeepFloyd IF DeepFloyd. The rows correspond to guidance weights $w \in \{1, 2, 3, 4\}$. CFG and our method (XL-sized model guided by M-sized one) on the leftmost and rightmost column, respectively. The middle columns correspond to blending between the two.
  • ...and 3 more figures