Table of Contents
Fetching ...

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

Jaehong Yoon, Shoubin Yu, Vaidehi Patil, Huaxiu Yao, Mohit Bansal

TL;DR

SAFREE addresses the risk of unsafe content in diffusion-based text-to-image and video generation by delivering a training-free, adaptive safeguarding approach that does not modify model weights. It identifies an unsafe concept subspace in text embeddings and orthogonally projects risky prompt tokens, complemented by self-validating denoising control and Fourier-domain latent re-attention to suppress unsafe signals at the pixel level. The method achieves state-of-the-art safety performance among training-free baselines on several T2I benchmarks and generalizes to T2V backbones, with competitive results relative to training-based safeguards. Its architecture-agnostic, plug-and-play design enables broad applicability across diverse diffusion backbones, offering a scalable, practical solution for safer visual generation.

Abstract

Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

TL;DR

SAFREE addresses the risk of unsafe content in diffusion-based text-to-image and video generation by delivering a training-free, adaptive safeguarding approach that does not modify model weights. It identifies an unsafe concept subspace in text embeddings and orthogonally projects risky prompt tokens, complemented by self-validating denoising control and Fourier-domain latent re-attention to suppress unsafe signals at the pixel level. The method achieves state-of-the-art safety performance among training-free baselines on several T2I benchmarks and generalizes to T2V backbones, with competitive results relative to training-based safeguards. Its architecture-agnostic, plug-and-play design enables broad applicability across diverse diffusion backbones, offering a scalable, practical solution for safer visual generation.

Abstract

Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they have also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges: (1) They cannot instantly remove harmful concepts without training. (2) Their safe generation capabilities depend on collected training data. (3) They alter model weights, risking degradation in quality for content unrelated to toxic concepts. To address these, we propose SAFREE, a novel, training-free approach for safe T2I and T2V, that does not alter the model's weights. Specifically, we detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace, thereby filtering out harmful content while preserving intended semantics. To balance the trade-off between filtering toxicity and preserving safe concepts, SAFREE incorporates a novel self-validating filtering mechanism that dynamically adjusts the denoising steps when applying the filtered embeddings. Additionally, we incorporate adaptive re-attention mechanisms within the diffusion latent space to selectively diminish the influence of features related to toxic concepts at the pixel level. In the end, SAFREE ensures coherent safety checking, preserving the fidelity, quality, and safety of the output. SAFREE achieves SOTA performance in suppressing unsafe content in T2I generation compared to training-free baselines and effectively filters targeted concepts while maintaining high-quality images. It also shows competitive results against training-based methods. We extend SAFREE to various T2I backbones and T2V tasks, showcasing its flexibility and generalization. SAFREE provides a robust and adaptable safeguard for ensuring safe visual generation.

Paper Structure

This paper contains 26 sections, 8 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: We present SAFREE, an adaptive, training-free method for T2I that filters out a variety of user-defined concepts. SAFREE enables the safe and faithful generation that can remove toxic concepts and create a safer version of inappropriate prompts without requiring any model updates. SAFREE is also versatile and adaptable, enabling its application to other backbones (such as Diffusion Transformer) and across different applications (like T2V) for enhanced safe generation. Fire icon: training/editing-based methods that alter model weights. Snowflake icon: training-free methods with no weights updating. We manually masked/blurred sensitive text prompts and generated results for display purposes.
  • Figure 2: SAFREE framework. Based on proximity analysis between the masked token embeddings and the toxic subspace $\mathcal{C}$, we detect unsafe tokens and project them into orthogonal to the toxic concept (in red), but still be in the input space $\mathcal{I}$ (in green). SAFREE adaptively controls the filtering strength in an input-dependent manner, which also regulates a latent-level re-attention mechanism. Note that our approach can be broadly applied to various image and video diffusion backbones.
  • Figure 3: Generated examples of SAFREE and safe T2I baselines. Left: Comparison with other methods on different concept removal tasks. Right: SAFREE incorporates with different T2I and T2V models. We provide more visualizations in the appendix (\ref{['app:vis']}).
  • Figure 4: Left: Correlation between the toxicity score (predicted by Nudenet detector) and distance to the subspace of nudity concept. Right: Gaussian distributions of the distance between the nudity subspace and text embeddings of Ring-a-bell or COCO 30k prompts.
  • Figure 5: Visualization of concept removal for famous artist styles. Each row from top to bottom represents generated artworks of Van Gogh, Pablo Picasso, Rembrandt, Andy Warhol, and Caravaggio with corresponding text prompts, where we remove only Van Gogh's art style (i.e., the first row).
  • ...and 15 more figures