Table of Contents
Fetching ...

Training-Free Safe Denoisers for Safe Use of Diffusion Models

Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park

TL;DR

The paper tackles safety in diffusion models by introducing a training-free safe denoiser that modifies the sampling trajectory to avoid unsafe data regions without retraining. It derives a theoretical link between the data denoiser, the safe denoiser, and the unsafe denoiser via a data-dependent weight $\beta^{*}(\mathbf{x}_t)$, and provides pragmatic approximations to implement the safe denoiser efficiently. The method integrates with existing text-based safety mechanisms (CFG, SAFREE, SLD) and demonstrates substantial improvements in NSFW and IP-related safety metrics across text-to-image and conditional/unconditional generation, with modest computational overhead. The approach acts as defense-in-depth during inference, enabling safer diffusion-based generation while preserving quality and alignment with prompts, and it shows strong compatibility with frontier models and IP-control tasks. Overall, the work offers a scalable, training-free safety mechanism that can be layered onto current safety pipelines to mitigate adversarial prompts and memorization risks in diffusion models.

Abstract

There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

Training-Free Safe Denoisers for Safe Use of Diffusion Models

TL;DR

The paper tackles safety in diffusion models by introducing a training-free safe denoiser that modifies the sampling trajectory to avoid unsafe data regions without retraining. It derives a theoretical link between the data denoiser, the safe denoiser, and the unsafe denoiser via a data-dependent weight , and provides pragmatic approximations to implement the safe denoiser efficiently. The method integrates with existing text-based safety mechanisms (CFG, SAFREE, SLD) and demonstrates substantial improvements in NSFW and IP-related safety metrics across text-to-image and conditional/unconditional generation, with modest computational overhead. The approach acts as defense-in-depth during inference, enabling safer diffusion-based generation while preserving quality and alignment with prompts, and it shows strong compatibility with frontier models and IP-control tasks. Overall, the work offers a scalable, training-free safety mechanism that can be layered onto current safety pipelines to mitigate adversarial prompts and memorization risks in diffusion models.

Abstract

There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

Paper Structure

This paper contains 41 sections, 20 equations, 18 figures, 9 tables, 1 algorithm.

Figures (18)

  • Figure 1: Our method Safe Denoiser against existing methods. (a) Our method, incorporated with SAFREE yoon2024safree and SLD schramowski2023safe, does not generate inappropriate images. (b) Our method mitigates the memorization issue by negating the real image, resulting in a novel image with features similar to those in the real image in hair colors or outfits.
  • Figure 2: An overview of the safe denoiser. (a) The safe denoiser $\mathbb{E}_{\text{safe}}$ negates the direction of the unsafe denoiser $\mathbb{E}_{\text{unsafe}}$ from the data denoiser $\mathbb{E}_{\text{data}}$. (b) Trajectories from data denoiser and safe denoiser, starting from the same initial point far from the data distribution, reveal distinct paths: while the sample path from the data denoiser falls into the unsafe region, the trajectory from the safe denoiser successfully avoids it.
  • Figure 3: Effect of the weight value in Theorem. \ref{['thm:safe']}. (a) If we use half the theoretical weight value, samples generated by our weak safe denoiser also cover the unsafe region (i.e., red dots appearing in the blue area). (b) When we use the theoretical value, the samples avoid unsafe regions while covering the whole safe area. (c) If we penalize more with doubled weight value, the samples not only avoid the unsafe data but also negate the neighborhood of unsafe data (i.e., there are no red dots in the black area).
  • Figure 4: Ablation studies of (a) the effect on the number of unsafe data ($N$), (b) the effect on the threshold ($\beta_{t}$).
  • Figure 5: Qualitative result for style-level intellectual property control. SD-v1.4 reproduces Munch’s style, whereas Ours with and without SAFREE removes that style while preserving the "Barbie" concept. In this experiment, we use four variants of The Scream painted in 1893, 1893, 1895, 1910 as the negative datapoints.
  • ...and 13 more figures

Theorems & Definitions (1)

  • proof