Table of Contents
Fetching ...

Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

Die Chen, Zhiwen Li, Cen Chen, Yuexiang Xie, Xiaodan Li, Jinyan Ye, Yingda Chen, Yaliang Li

TL;DR

This work tackles NSFW content generation in text-to-image diffusion by delivering a full-pipeline benchmark for NSFW concept erasure. It introduces dataset-enrichment, a taxonomy of erasure methods, automated NSFW detectors, and a multi-metric evaluation framework, then systematically evaluates 13 methods across erasure effectiveness, data sensitivity, robustness to adversarial prompts, and preservation of unrelated concepts. Key findings show post-hoc approaches are efficient but vulnerable to evasion, adversarial fine-tuning enhances robustness but often reduces generation quality, and increasing data alone offers limited gains, especially for abstract topics like horror. The benchmark advances content safety research by providing practical guidance and a foundation for future, broader evaluations in diffusion-based content generation.

Abstract

Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

TL;DR

This work tackles NSFW content generation in text-to-image diffusion by delivering a full-pipeline benchmark for NSFW concept erasure. It introduces dataset-enrichment, a taxonomy of erasure methods, automated NSFW detectors, and a multi-metric evaluation framework, then systematically evaluates 13 methods across erasure effectiveness, data sensitivity, robustness to adversarial prompts, and preservation of unrelated concepts. Key findings show post-hoc approaches are efficient but vulnerable to evasion, adversarial fine-tuning enhances robustness but often reduces generation quality, and increasing data alone offers limited gains, especially for abstract topics like horror. The benchmark advances content safety research by providing practical guidance and a foundation for future, broader evaluations in diffusion-based content generation.

Abstract

Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

Paper Structure

This paper contains 28 sections, 3 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Our benchmark framework is built around a full-pipeline toolkit specifically designed to investigate concept erasure from four key evaluation perspectives.
  • Figure 2: Information of four NSFW-related datasets (left) and the toxicity relationship (right).
  • Figure 3: The Erasure proportion (EP $\uparrow$) of three themes in two modes. A larger method coverage area indicates better performance.
  • Figure 4: The descriptions for three themes and image examples.
  • Figure 5: Qualitative examples of different methods, targeting different themes.
  • ...and 1 more figures