Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding
Huming Qiu, Guanxu Chen, Mi Zhang, Xiaohan Zhang, Xiaoyu You, Min Yang
TL;DR
This work tackles unsafe content generation in text-to-image models by identifying the prompt embedding as a key source of risk. It introduces Embedding Sanitizer (ES), a vision-agnostic, interpretable, plug-and-play framework that assigns token-level safety scores and sanitizes prompt embeddings to map toxic tokens toward safe anchor concepts. Through a synthetic data driven training regime with target-anchor token pairs, ES achieves state-of-the-art robustness against handcrafted and adversarial prompts while preserving image fidelity and semantic alignment. The approach offers strong compatibility with existing safeguards and model variants, and its interpretability and controllability via the hyperparameter \alpha make it practical for real-world safe generation deployments. Overall, ES advances safe generation by addressing text-level vulnerabilities and providing a scalable, modular defense that complements existing safeguards.
Abstract
In recent years, text-to-image (T2I) generation models have made significant progress in generating high-quality images that align with text descriptions. However, these models also face the risk of unsafe generation, potentially producing harmful content that violates usage policies, such as explicit material. Existing safe generation methods typically focus on suppressing inappropriate content by erasing undesired concepts from visual representations, while neglecting to sanitize the textual representation. Although these methods help mitigate the risk of misuse to some extent, their robustness remains insufficient when dealing with adversarial attacks. Given that semantic consistency between input text and output image is a core requirement of T2I models, we identify that textual representations are likely the primary source of unsafe generation. To this end, we propose Embedding Sanitizer (ES), which enhances the safety of T2I models by sanitizing inappropriate concepts in prompt embeddings. To our knowledge, ES is the first interpretable safe generation framework that assigns a score to each token in the prompt to indicate its potential harmfulness. In addition, ES adopts a plug-and-play modular design, offering compatibility for seamless integration with various T2I models and other safeguards. Evaluations on five prompt benchmarks show that ES outperforms eleven existing safeguard baselines, achieving state-of-the-art robustness while maintaining high-quality image generation.
