Table of Contents
Fetching ...

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

Peigui Qi, Kunsheng Tang, Wenbo Zhou, Weiming Zhang, Nenghai Yu, Tianwei Zhang, Qing Guo, Jie Zhang

TL;DR

SafeGuider addresses the vulnerability of text-to-image models to adversarial prompts by exploiting the EOS token as a semantic aggregator. It introduces a two-step, embedding-level safety approach: Step I uses an EOS-based recognizer to detect unsafe prompts, and Step II applies a Safety-Aware Feature Erasure beam search to steer unsafe prompts toward safe, semantically meaningful embeddings for image generation. The framework delivers strong robustness against in-domain and out-of-domain attacks (ASR as low as 5.48%), preserves generation quality for benign prompts (GSR ≈100%), and demonstrates transferability to SD-V2.1 and Flux architectures, including resistance to adaptive attacks. These results suggest SafeGuider offers a practical, architecture-agnostic pathway to secure, real-world text-to-image deployments without resorting to blunt generation refusal or semantic distortion of safe prompts.

Abstract

Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.

SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models

TL;DR

SafeGuider addresses the vulnerability of text-to-image models to adversarial prompts by exploiting the EOS token as a semantic aggregator. It introduces a two-step, embedding-level safety approach: Step I uses an EOS-based recognizer to detect unsafe prompts, and Step II applies a Safety-Aware Feature Erasure beam search to steer unsafe prompts toward safe, semantically meaningful embeddings for image generation. The framework delivers strong robustness against in-domain and out-of-domain attacks (ASR as low as 5.48%), preserves generation quality for benign prompts (GSR ≈100%), and demonstrates transferability to SD-V2.1 and Flux architectures, including resistance to adaptive attacks. These results suggest SafeGuider offers a practical, architecture-agnostic pathway to secure, real-world text-to-image deployments without resorting to blunt generation refusal or semantic distortion of safe prompts.

Abstract

Text-to-image models have shown remarkable capabilities in generating high-quality images from natural language descriptions. However, these models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content. Despite various defensive strategies, achieving robustness against attacks while maintaining practical utility in real-world applications remains a significant challenge. To address this issue, we first conduct an empirical study of the text encoder in the Stable Diffusion (SD) model, which is a widely used and representative text-to-image model. Our findings reveal that the [EOS] token acts as a semantic aggregator, exhibiting distinct distributional patterns between benign and adversarial prompts in its embedding space. Building on this insight, we introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality. SafeGuider combines an embedding-level recognition model with a safety-aware feature erasure beam search algorithm. This integration enables the framework to maintain high-quality image generation for benign prompts while ensuring robust defense against both in-domain and out-of-domain attacks. SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48\% across various attack scenarios. Moreover, instead of refusing to generate or producing black images for unsafe prompts, SafeGuider generates safe and meaningful images, enhancing its practical utility. In addition, SafeGuider is not limited to the SD model and can be effectively applied to other text-to-image models, such as the Flux model, demonstrating its versatility and adaptability across different architectures. We hope that SafeGuider can shed some light on the practical deployment of secure text-to-image systems.

Paper Structure

This paper contains 38 sections, 6 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Examples of adversarial attacks on Stable Diffusion models. 1) Vocabulary substitution (blue): replacing explicit terms with innocuous ones. 2) Symbol injection (orange): adding adversarial symbols to generate unsafe content.
  • Figure 2: Examples of defenses implemented on SD-V1.4 against out-of-domain adversarial attacks. Both attacks successfully circumvent all defenses, revealing robustness challenges.
  • Figure 3: Practical impact of defenses on SD-V1.4. Current defenses either compromise the semantics of benign generation (internal) or refuse to generate (external), revealing practicality challenges. Benign prompt: "There is a woman walking the sidewalk".
  • Figure 4: Illustration of the generation pipeline of the Stable Diffusion model.
  • Figure 5: Attention visualization in SD-V1.4's text encoder. Lines show attention flows from input tokens (right) to the [EOS] token (lower-left corner). Colors denote attention heads and line thickness shows attention weights. The [EOS] token's consistent attention to all tokens across layers reveals its role as a condition feature aggregator.
  • ...and 9 more figures