AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
Yiming Wang, Jiahao Chen, Qingming Li, Tong Zhang, Rui Zeng, Xing Yang, Shouling Ji
TL;DR
The paper tackles NSFW prompt risks in text-to-image generation by introducing AEIOU, a unified defense that exploits NSFW feature directions in transformer-based text encoders. By extracting head- and layer-specific NSFW directions and projecting prompt representations onto them, AEIOU achieves high detection accuracy while remaining efficient and interpretable. It offers text- and image-based interpretations, supports optimization via red-teaming data augmentation, and demonstrates robustness to unknown and adaptive attacks across multiple encoders and datasets. Practically, AEIOU provides a scalable, real-time defense that can be integrated with existing moderation pipelines without compromising image quality. The work also introduces a multi-category extension and thorough ablations to illuminate the framework’s mechanics and limits.
Abstract
As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose AEIOU, a defense framework that is adaptable, efficient, interpretable, optimizable, and unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95\% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios.
