Table of Contents
Fetching ...

Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation

Xin Zhao, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao, Xiaojun Jia, Xinfeng Li, Xiaofeng Wang

TL;DR

Buster addresses the challenge of NSFW content in Text-to-Image models by implanting semantic backdoors into the text encoder. It fuses energy-based data augmentation with Langevin dynamics and a teacher-guided poisoning framework to align adversarial prompts with a safe target while preserving benign outputs, updating only the text encoder for efficiency. The approach achieves NSFW removal rates exceeding 91.2% while maintaining image quality, and demonstrates strong generalization across NSFW categories, resilience to adaptive attacks, and scalability across diffusion-model variants. This method offers a practical, scalable defense for open T2I systems, enabling safer deployment with minimal impact on benign content.

Abstract

The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept definition. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Particularly, Buster fine-tunes the text encoder of Text-to-Image models within merely five minutes, showcasing its efficiency. Our extensive experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2\% while preserving the quality of harmless images.

Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation

TL;DR

Buster addresses the challenge of NSFW content in Text-to-Image models by implanting semantic backdoors into the text encoder. It fuses energy-based data augmentation with Langevin dynamics and a teacher-guided poisoning framework to align adversarial prompts with a safe target while preserving benign outputs, updating only the text encoder for efficiency. The approach achieves NSFW removal rates exceeding 91.2% while maintaining image quality, and demonstrates strong generalization across NSFW categories, resilience to adaptive attacks, and scalability across diffusion-model variants. This method offers a practical, scalable defense for open T2I systems, enabling safer deployment with minimal impact on benign content.

Abstract

The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept definition. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Particularly, Buster fine-tunes the text encoder of Text-to-Image models within merely five minutes, showcasing its efficiency. Our extensive experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2\% while preserving the quality of harmless images.

Paper Structure

This paper contains 32 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Possible defense mechanism deployed by T2I models. ① N/A: (a) no defense. ② Post-hoc Content Moderation: (b) text-based, and (c) image-based. ③ Model Fine-tuning: (d) fine-tuned U-Net, and (e) poisoned text encoder (ours).
  • Figure 2: Pipeline of T2I architecture.
  • Figure 3: The framework of our proposed Buster. The semantic-oriented data augmentation module is used for enhancing adversarial dataset. During the training process, we utilize a pre-trained clean text encoder as a teacher model to guide the poisoned text encoder. Adversarial prompts are processed by the poisoned text encoder and aligned with the target prompt embeddings generated by the clean text encoder. Benign prompts are fed into both encoders to ensure consistency. During the sampling phase, benign prompts input into the poisoned T2I model produce normal images. However, if the input prompts contain NSFW content, the poisoned T2I model generates the target images instead.
  • Figure 4: Visualization of data distribution for benign and adversarial prompts.
  • Figure 5: Nude and benign images generated by Buster as well as other methods.
  • ...and 4 more figures