Table of Contents
Fetching ...

AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

Yiming Wang, Jiahao Chen, Qingming Li, Tong Zhang, Rui Zeng, Xing Yang, Shouling Ji

TL;DR

The paper tackles NSFW prompt risks in text-to-image generation by introducing AEIOU, a unified defense that exploits NSFW feature directions in transformer-based text encoders. By extracting head- and layer-specific NSFW directions and projecting prompt representations onto them, AEIOU achieves high detection accuracy while remaining efficient and interpretable. It offers text- and image-based interpretations, supports optimization via red-teaming data augmentation, and demonstrates robustness to unknown and adaptive attacks across multiple encoders and datasets. Practically, AEIOU provides a scalable, real-time defense that can be integrated with existing moderation pipelines without compromising image quality. The work also introduces a multi-category extension and thorough ablations to illuminate the framework’s mechanics and limits.

Abstract

As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose AEIOU, a defense framework that is adaptable, efficient, interpretable, optimizable, and unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95\% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios.

AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

TL;DR

The paper tackles NSFW prompt risks in text-to-image generation by introducing AEIOU, a unified defense that exploits NSFW feature directions in transformer-based text encoders. By extracting head- and layer-specific NSFW directions and projecting prompt representations onto them, AEIOU achieves high detection accuracy while remaining efficient and interpretable. It offers text- and image-based interpretations, supports optimization via red-teaming data augmentation, and demonstrates robustness to unknown and adaptive attacks across multiple encoders and datasets. Practically, AEIOU provides a scalable, real-time defense that can be integrated with existing moderation pipelines without compromising image quality. The work also introduces a multi-category extension and thorough ablations to illuminate the framework’s mechanics and limits.

Abstract

As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose AEIOU, a defense framework that is adaptable, efficient, interpretable, optimizable, and unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95\% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios.

Paper Structure

This paper contains 33 sections, 17 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Left: The distribution of output embeddings from CLIP model. Right: Example 1 corresponds to the prompt "A woman stands on the beach, facing the sea." Example 2 corresponds to the prompt "A naked woman stands on the beach, facing the sea." On the right are the images generated from them.
  • Figure 2: PCA maps of hidden states from different layers and different heads.
  • Figure 3: The overall framework of AEIOU. The blue arrows represent the training process, where data enters the text encoder, and AEIOU extracts NSFW features from the hidden states. The green arrows indicate the inference process, where the prompt passes through the text encoder for detection and interpretation; if it passes, image generation proceeds, otherwise, generation is denied and an interpretation is provided. The red arrows indicate the data augmentation process, involving red-teaming tests on AEIOU, where NSFW prompts that successfully bypass detection are added to the training data for data augmentation.
  • Figure 4: Image-based interpretation with Stable Diffusion v1.4. In order to mitigate potential impact on the reader, we follow the established convention of prior work yang2024mmaba2023surrogateprompt and obscure the NSFW images using both blurring and masking techniques.
  • Figure 5: ROC curves of all methods.
  • ...and 4 more figures