Table of Contents
Fetching ...

Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-Chul Moon

TL;DR

This work tackles unsafe outputs in text-to-image diffusion models trained on large, web-scraped data. It introduces Safe Text Embedding Guidance (STG), a training-free approach that dynamically shifts text embeddings during sampling using a safety function derived from intermediate diffusion states, thereby steering outputs toward safety with minimal degradation in quality. The authors offer theoretical insight showing STG preserves the base model distribution while incorporating safety, and they contrast STG with Safe Data Guidance (SDG) and other baselines across nudity, violence, and artist-style removal. Empirical results demonstrate STG’s robust, generalizable safety improvements across backbones and samplers, with flexible control via the update scale ρ and related hyperparameters, and they release the code for reproducibility. Overall, STG provides a practical, training-free safeguard that can adapt to diverse safety criteria without retraining the diffusion model, enhancing real-world applicability of text-to-image generation.

Abstract

Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.

Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

TL;DR

This work tackles unsafe outputs in text-to-image diffusion models trained on large, web-scraped data. It introduces Safe Text Embedding Guidance (STG), a training-free approach that dynamically shifts text embeddings during sampling using a safety function derived from intermediate diffusion states, thereby steering outputs toward safety with minimal degradation in quality. The authors offer theoretical insight showing STG preserves the base model distribution while incorporating safety, and they contrast STG with Safe Data Guidance (SDG) and other baselines across nudity, violence, and artist-style removal. Empirical results demonstrate STG’s robust, generalizable safety improvements across backbones and samplers, with flexible control via the update scale ρ and related hyperparameters, and they release the code for reproducibility. Overall, STG provides a practical, training-free safeguard that can adapt to diverse safety criteria without retraining the diffusion model, enhancing real-world applicability of text-to-image generation.

Abstract

Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.

Paper Structure

This paper contains 34 sections, 2 theorems, 20 equations, 15 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $q_t({\mathbf{x}}_t|{\mathbf{c}})$ be the text-conditional distribution at diffusion timestep $t$, and $g_t({\mathbf{x}}_t, {\mathbf{c}})$ be a time-dependent safety function at $t$. If the text embedding ${\mathbf{c}}$ is updated using STG with the step size $\rho$, then the resulting score fun

Figures (15)

  • Figure 1: Overview of safe generation methods for diffusion models. Red borders indicate the components used in the safety methods. Training-based approaches fine-tune the diffusion model using additional resources and do not use samples at test time. Previous training-free methods yoon2024safree adjust text embeddings independently of the diffusion model. Our training-free method directly guides text embeddings using the diffusion model and its intermediate images, ensuring safer outputs without additional training. For publication purposes, the generated images are masked and blurred.
  • Figure 2: Generated samples from the 2D toy example with condition ${\mathbf{c}}=(1,0)$ using SDG and STG with different safety functions. $g^*$ is the ideal safety function, proportional to the true safe distribution $p(o=1|{\mathbf{x}}_0)$, while the approximated safety function $\tilde{g}$ preserves relative order but differs in shape. The blue dots represent samples from the true safe conditional distribution $q({\mathbf{x}}_0|{\mathbf{c}},o=1)$, and the green dots indicate instances generated using each guidance method. The background heatmap shows the contours of the respective safety functions. The value in parentheses in each figure title indicates the KL divergence between the true safe conditional distribution and the generated samples.
  • Figure 2: Results for generation quality on the COCO dataset across various safe generation methods applied for nudity removal.
  • Figure 3: Trade-off between defense success rate and prior preservation on nudity and violence. Each experiment is repeated three times with different random seeds, and the mean values are shown as points while the standard deviations are indicated by error bars.
  • Figure 4: Generated images from STG and other safe generation baselines for nudity and artist-style removal scenarios. For publication purposes, the generated images are masked.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof