Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models
Jaesin Ahn, Heechul Jung
TL;DR
This work tackles the risk of sexual content generation in text-conditioned diffusion models by introducing Distorting Embedding Space (DES), a text-encoder–level defense that distorts unsafe embeddings toward safe regions and neutralizes the nudity concept. DES employs a two-phase approach: (1) target vector generation to identify safe anti-nudity directions, and (2) training with three losses that distort unsafe space, preserve safe semantics, and neutralize nudity, achieving zero inference overhead and rapid training. Empirically, DES achieves state-of-the-art ASR reductions on FLUX.1 and SDv1.5 (9.47% and 0.52% respectively) with substantial improvements over prior methods, while maintaining competitive FID and CLIP scores across T2I and I2I tasks and generalizing to multiple NSFW concepts. The proposed framework offers practical deployment advantages, including minimal training time and compatibility with recent diffusion models, contributing a robust, scalable defense against adversarial and adaptive attacks in NSFW content generation.
Abstract
Diffusion models show remarkable image generation performance following text prompts, but risk generating sexual contents. Existing approaches, such as prompt filtering, concept removal, and even sexual contents mitigation methods, struggle to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the ``nudity'' embedding, by aligning it with neutral embedding to enhance robustness against adversarial attacks. As a result, extensive experiments on explicit content mitigation and adaptive attack defense show that DES achieves state-of-the-art (SOTA) defense, with attack success rate (ASR) of 9.47% on FLUX.1, a recent popular model, and 0.52% on the widely adopted Stable Diffusion v1.5. These correspond to ASR reductions of 76.5% and 63.9% compared to previous SOTA methods, EraseAnything and AdvUnlearn, respectively. Furthermore, DES maintains benign image quality, achieving Frechet Inception Distance and CLIP score comparable to those of the original FLUX.1 and Stable Diffusion v1.5.
