Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-Chul Moon
TL;DR
This work tackles unsafe outputs in text-to-image diffusion models trained on large, web-scraped data. It introduces Safe Text Embedding Guidance (STG), a training-free approach that dynamically shifts text embeddings during sampling using a safety function derived from intermediate diffusion states, thereby steering outputs toward safety with minimal degradation in quality. The authors offer theoretical insight showing STG preserves the base model distribution while incorporating safety, and they contrast STG with Safe Data Guidance (SDG) and other baselines across nudity, violence, and artist-style removal. Empirical results demonstrate STG’s robust, generalizable safety improvements across backbones and samplers, with flexible control via the update scale ρ and related hyperparameters, and they release the code for reproducibility. Overall, STG provides a practical, training-free safeguard that can adapt to diverse safety criteria without retraining the diffusion model, enhancing real-world applicability of text-to-image generation.
Abstract
Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.
