Training-Free Safe Denoisers for Safe Use of Diffusion Models
Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park
TL;DR
The paper tackles safety in diffusion models by introducing a training-free safe denoiser that modifies the sampling trajectory to avoid unsafe data regions without retraining. It derives a theoretical link between the data denoiser, the safe denoiser, and the unsafe denoiser via a data-dependent weight $\beta^{*}(\mathbf{x}_t)$, and provides pragmatic approximations to implement the safe denoiser efficiently. The method integrates with existing text-based safety mechanisms (CFG, SAFREE, SLD) and demonstrates substantial improvements in NSFW and IP-related safety metrics across text-to-image and conditional/unconditional generation, with modest computational overhead. The approach acts as defense-in-depth during inference, enabling safer diffusion-based generation while preserving quality and alignment with prompts, and it shows strong compatibility with frontier models and IP-control tasks. Overall, the work offers a scalable, training-free safety mechanism that can be layered onto current safety pipelines to mitigate adversarial prompts and memorization risks in diffusion models.
Abstract
There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.
