Table of Contents
Fetching ...

Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models

Jaesin Ahn, Heechul Jung

TL;DR

This work tackles the risk of sexual content generation in text-conditioned diffusion models by introducing Distorting Embedding Space (DES), a text-encoder–level defense that distorts unsafe embeddings toward safe regions and neutralizes the nudity concept. DES employs a two-phase approach: (1) target vector generation to identify safe anti-nudity directions, and (2) training with three losses that distort unsafe space, preserve safe semantics, and neutralize nudity, achieving zero inference overhead and rapid training. Empirically, DES achieves state-of-the-art ASR reductions on FLUX.1 and SDv1.5 (9.47% and 0.52% respectively) with substantial improvements over prior methods, while maintaining competitive FID and CLIP scores across T2I and I2I tasks and generalizing to multiple NSFW concepts. The proposed framework offers practical deployment advantages, including minimal training time and compatibility with recent diffusion models, contributing a robust, scalable defense against adversarial and adaptive attacks in NSFW content generation.

Abstract

Diffusion models show remarkable image generation performance following text prompts, but risk generating sexual contents. Existing approaches, such as prompt filtering, concept removal, and even sexual contents mitigation methods, struggle to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the ``nudity'' embedding, by aligning it with neutral embedding to enhance robustness against adversarial attacks. As a result, extensive experiments on explicit content mitigation and adaptive attack defense show that DES achieves state-of-the-art (SOTA) defense, with attack success rate (ASR) of 9.47% on FLUX.1, a recent popular model, and 0.52% on the widely adopted Stable Diffusion v1.5. These correspond to ASR reductions of 76.5% and 63.9% compared to previous SOTA methods, EraseAnything and AdvUnlearn, respectively. Furthermore, DES maintains benign image quality, achieving Frechet Inception Distance and CLIP score comparable to those of the original FLUX.1 and Stable Diffusion v1.5.

Mitigating Sexual Content Generation via Embedding Distortion in Text-conditioned Diffusion Models

TL;DR

This work tackles the risk of sexual content generation in text-conditioned diffusion models by introducing Distorting Embedding Space (DES), a text-encoder–level defense that distorts unsafe embeddings toward safe regions and neutralizes the nudity concept. DES employs a two-phase approach: (1) target vector generation to identify safe anti-nudity directions, and (2) training with three losses that distort unsafe space, preserve safe semantics, and neutralize nudity, achieving zero inference overhead and rapid training. Empirically, DES achieves state-of-the-art ASR reductions on FLUX.1 and SDv1.5 (9.47% and 0.52% respectively) with substantial improvements over prior methods, while maintaining competitive FID and CLIP scores across T2I and I2I tasks and generalizing to multiple NSFW concepts. The proposed framework offers practical deployment advantages, including minimal training time and compatibility with recent diffusion models, contributing a robust, scalable defense against adversarial and adaptive attacks in NSFW content generation.

Abstract

Diffusion models show remarkable image generation performance following text prompts, but risk generating sexual contents. Existing approaches, such as prompt filtering, concept removal, and even sexual contents mitigation methods, struggle to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the ``nudity'' embedding, by aligning it with neutral embedding to enhance robustness against adversarial attacks. As a result, extensive experiments on explicit content mitigation and adaptive attack defense show that DES achieves state-of-the-art (SOTA) defense, with attack success rate (ASR) of 9.47% on FLUX.1, a recent popular model, and 0.52% on the widely adopted Stable Diffusion v1.5. These correspond to ASR reductions of 76.5% and 63.9% compared to previous SOTA methods, EraseAnything and AdvUnlearn, respectively. Furthermore, DES maintains benign image quality, achieving Frechet Inception Distance and CLIP score comparable to those of the original FLUX.1 and Stable Diffusion v1.5.

Paper Structure

This paper contains 51 sections, 7 equations, 19 figures, 22 tables, 2 algorithms.

Figures (19)

  • Figure 1: Performance comparison and conceptual diagram of our approach. (a) Proposed approach offers the best performance in ASR and FID fid, while also being cost-effective in training. The relative circle sizes indicate training time. ASRs are averaged over multiple unsafe prompts, such as Sneaky sneaky, MMA mma, I2P sld, Ring-A-Bell ringabell, and P4D p4d. (b) Our approach distorts the unsafe embedding space by transforming unsafe embeddings into safe regions, ensuring that embeddings derived from unsafe or adversarial prompts result in benign content generation.
  • Figure 2: Overview of DES framework. During target vector generation phase, DES searches safe-unsafe vector pairs and creates target vectors by subtracting "nudity" direction from minimum similarity safe vectors. In training phase, DES aligns unsafe vectors with target vectors and maintains safe vectors by aligning both their current and nudity-integrated states with the originals. It also aligns the "nudity" vector with a neutral vector, removing its semantics. Here, $v$ and $\tilde{v}$ denote vectors from the original and training text encoders, respectively.
  • Figure 3: Cosine similarity distributions between $n$ and other vectors. Selected safe vectors initially exhibit positive similarities, which decrease as the $\frac{n}{\|n\|}$, scaled by $\alpha$, is subtracted.
  • Figure 4: Mechanism of loss adjustment. Visualization of how the loss is adaptively scaled based on the correlation between ${s}_{i}$ and $n$. It assigns a larger loss to vectors dissimilar to $n$ and a smaller loss to those similar to $n$.
  • Figure 5: Qualitative comparison of defense methods in T2I generation. The top row displays results from adversarial prompts, while the bottom row shows results from safe prompts. For benign image generation, words highlighted in red are occasionally omitted by some methods.
  • ...and 14 more figures