Table of Contents
Fetching ...

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

TL;DR

Dynamic Guidance tackles diffusion-model hallucinations by adaptively sharpening the score function $s_ heta(x_t,t)$ along directions that induce artifacts during sampling, while preserving benign semantic interpolations. At each timestep, it identifies the most likely class $y^*= ext{argmax}_y\log p(y|x_t)$ and applies guided denoising with $\\hat{\epsilon}=\epsilon_\theta(x_t,t) - \lambda\sqrt{1-\bar{\alpha}_t}\nabla_{x_t}\log p(y^*|x_t)$, enabling generation-time control without fixed early-conditioning. The method is evaluated from toy 2D Gaussians to real-world ImageNet-scale generation, showing substantial hallucination reductions (often >50%) across settings and improved proxy metrics (precision, Inception Score) over traditional guidance and post-hoc filtering. This work provides a principled, efficient approach to reduce hallucinations in diffusion sampling, with practical impact for more reliable and diverse image generation.

Abstract

Diffusion models, despite their impressive demos, often produce hallucinatory samples with structural inconsistencies that lie outside of the support of the true data distribution. Such hallucinations can be attributed to excessive smoothing between modes of the data distribution. However, semantic interpolations are often desirable and can lead to generation diversity, thus we believe a more nuanced solution is required. In this work, we introduce Dynamic Guidance, which tackles this issue. Dynamic Guidance mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

TL;DR

Dynamic Guidance tackles diffusion-model hallucinations by adaptively sharpening the score function along directions that induce artifacts during sampling, while preserving benign semantic interpolations. At each timestep, it identifies the most likely class and applies guided denoising with , enabling generation-time control without fixed early-conditioning. The method is evaluated from toy 2D Gaussians to real-world ImageNet-scale generation, showing substantial hallucination reductions (often >50%) across settings and improved proxy metrics (precision, Inception Score) over traditional guidance and post-hoc filtering. This work provides a principled, efficient approach to reduce hallucinations in diffusion sampling, with practical impact for more reliable and diverse image generation.

Abstract

Diffusion models, despite their impressive demos, often produce hallucinatory samples with structural inconsistencies that lie outside of the support of the true data distribution. Such hallucinations can be attributed to excessive smoothing between modes of the data distribution. However, semantic interpolations are often desirable and can lead to generation diversity, thus we believe a more nuanced solution is required. In this work, we introduce Dynamic Guidance, which tackles this issue. Dynamic Guidance mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

Paper Structure

This paper contains 32 sections, 10 equations, 14 figures, 6 tables, 2 algorithms.

Figures (14)

  • Figure 1: (a) Examples of training images, hallucinations, and corresponding samples fixed by Dynamic Guidance; (b) We pick an initial image that contains two different shapes (triangle + square), a hallucination for the Single Shapes dataset. We focus on a latent dimension that controls the appearance of the left shape (Top). To resolve the hallucination, the square on the left should disappear or turn into a triangle. In the in-between region, where the left shape is square or pentagon, the unguided score function is zero, "trapping" the sample and generating a hallucination. Dynamic Guidance sharpens the score in this region, steering the sample toward valid images that only contain triangles. Dynamic Guidance does not affect the score function along dimensions that are unrelated to hallucinations, like the one controlling the position of the shape on the right (Bottom).
  • Figure 2: Examples of valid samples and hallucinations for the Single Shapes (Top) and Mixed Shapes (Bottom) datasets.
  • Figure 3: Score Function Sharpening. The learned score function of the diffusion model with and without Dynamic Guidance, compared to the true score function for a 2D mixture of Gaussians across the x dimension. The model learns a smoothed-out score function, which Dynamic Guidance sharpens so that it more closely approximates the correct one.
  • Figure 4: Images generated with Classifier and Dynamic Guidance using the same initial noises. Initial condition for Classifier Guidance is either set to a specific class ("llama") (Top) or randomly selected (Bottom). For Dynamic Guidance we show the final predicted label. We observe that when the condition is fixed and misaligned with the initial noise the diffusion model can generate low-quality samples that visually resemble hallucinations.
  • Figure A.5: Distribution of final predicted ImageNet classes for samples generated with Classifier and Dynamic guidance using $\lambda=1$.
  • ...and 9 more figures