Table of Contents
Fetching ...

Dynamic VLM-Guided Negative Prompting for Diffusion Models

Hoyeon Chang, Seungjin Kim, Yoonseok Choi

TL;DR

This work tackles the limitations of fixed negative prompts in diffusion models by introducing dynamic negative prompting via Vision-Language Models (VL-DNP). The approach queries a VLM at selected denoising steps to generate adaptive negative prompts $c^-_{t_i}$ based on predicted $\hat{x}_0^{(i)}$, integrating with pretrained CFG-based diffusion without retraining. Empirically, VL-DNP improves safety (lower Attack Success Rate and Toxic Rate) while maintaining text-image alignment (CLIP) and image fidelity (FID) across COCO prompts and safety benchmarks, outperforming static prompting and SAFREE on Pareto fronts. The method offers a practical, real-time pathway to content-aware filtering in diffusion models, with potential efficiency gains from lightweight VLMs or caching in future work.

Abstract

We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

Dynamic VLM-Guided Negative Prompting for Diffusion Models

TL;DR

This work tackles the limitations of fixed negative prompts in diffusion models by introducing dynamic negative prompting via Vision-Language Models (VL-DNP). The approach queries a VLM at selected denoising steps to generate adaptive negative prompts based on predicted , integrating with pretrained CFG-based diffusion without retraining. Empirically, VL-DNP improves safety (lower Attack Success Rate and Toxic Rate) while maintaining text-image alignment (CLIP) and image fidelity (FID) across COCO prompts and safety benchmarks, outperforming static prompting and SAFREE on Pareto fronts. The method offers a practical, real-time pathway to content-aware filtering in diffusion models, with potential efficiency gains from lightweight VLMs or caching in future work.

Abstract

We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.

Paper Structure

This paper contains 17 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: VL-DNP inference pipeline. The positive text prompt is fed to a pretrained diffusion model. At a small set of timesteps $t_i\!\in\!\mathcal{T}$ we predict the clean image $\hat{x}_0$, query a lightweight vision–language model (VLM) and obtain a dynamic negative prompt. The prompt is fed back as classifier-free guidance, steering the remaining denoising steps away from any unsafe content detected in the intermediate image.
  • Figure 2: Safety–alignment Pareto plots. Circles = dynamic VLM-guided prompts; squares = static prompts; triangles = SAFREE baseline. Axes follow "larger = better": Safety $(1{-}\text{ASR})$ on $x$, CLIP on $y$ Markers are labelled by guidance scale $\omega$.