Table of Contents
Fetching ...

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

Die Chen, Zhiwen Li, Cen Chen, Xiaodan Li, Jinyan Ye

TL;DR

This work introduces the first systematic benchmark for NSFW concept erasure in text-to-image diffusion models, evaluating 11 methods (14 variants) across six perspectives (erasure proportion, excessive erasure, explicit/implicit prompts, image quality, semantic alignment, robustness) and two data modes (Mode 1 text-only, Mode 2 image-based). It combines task-level assessments with tool- and insight-level analyses, including toxicity studies of prompts, classifier comparisons, and a Genital Ratio Difference metric to quantify excessive erasure. Key findings show no method universally dominates; post-hoc approaches like SLD-Max excel at erasure but can hurt image quality and alignment, while methods like SLD-Str and UCE offer more balanced, robust performance. The authors provide practical recommendations, discuss limitations (notably compatibility with newer SD versions), and offer an open-source benchmark framework to guide future safety research in NSFW content erasure for diffusion models.

Abstract

Text-to-image (T2I) diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of these models can inadvertently led they to generate NSFW content even with efforts on filtering NSFW content from the training dataset, posing risks to their safe deployment. While several concept erasure methods have been proposed to mitigate this issue, a comprehensive evaluation of their effectiveness remains absent. To bridge this gap, we present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models. At the task level, we provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants. Specifically, we analyze these methods from six distinct assessment perspectives, including three conventional perspectives, i.e., erasure proportion, image quality, and semantic alignment, and three new perspectives, i.e., excessive erasure, the impact of explicit and implicit unsafe prompts, and robustness. At the tool level, we perform a detailed toxicity analysis of NSFW datasets and compare the performance of different NSFW classifiers, offering deeper insights into their performance alongside a compilation of comprehensive evaluation metrics. Our benchmark not only systematically evaluates concept erasure methods, but also delves into the underlying factors influencing their performance at the insight level. By synthesizing insights from various evaluation perspectives, we provide a deeper understanding of the challenges and opportunities in the field, offering actionable guidance and inspiration for advancing research and practical applications in concept erasure.

Comprehensive Assessment and Analysis for NSFW Content Erasure in Text-to-Image Diffusion Models

TL;DR

This work introduces the first systematic benchmark for NSFW concept erasure in text-to-image diffusion models, evaluating 11 methods (14 variants) across six perspectives (erasure proportion, excessive erasure, explicit/implicit prompts, image quality, semantic alignment, robustness) and two data modes (Mode 1 text-only, Mode 2 image-based). It combines task-level assessments with tool- and insight-level analyses, including toxicity studies of prompts, classifier comparisons, and a Genital Ratio Difference metric to quantify excessive erasure. Key findings show no method universally dominates; post-hoc approaches like SLD-Max excel at erasure but can hurt image quality and alignment, while methods like SLD-Str and UCE offer more balanced, robust performance. The authors provide practical recommendations, discuss limitations (notably compatibility with newer SD versions), and offer an open-source benchmark framework to guide future safety research in NSFW content erasure for diffusion models.

Abstract

Text-to-image (T2I) diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of these models can inadvertently led they to generate NSFW content even with efforts on filtering NSFW content from the training dataset, posing risks to their safe deployment. While several concept erasure methods have been proposed to mitigate this issue, a comprehensive evaluation of their effectiveness remains absent. To bridge this gap, we present the first systematic investigation of concept erasure methods for NSFW content and its sub-themes in text-to-image diffusion models. At the task level, we provide a holistic evaluation of 11 state-of-the-art baseline methods with 14 variants. Specifically, we analyze these methods from six distinct assessment perspectives, including three conventional perspectives, i.e., erasure proportion, image quality, and semantic alignment, and three new perspectives, i.e., excessive erasure, the impact of explicit and implicit unsafe prompts, and robustness. At the tool level, we perform a detailed toxicity analysis of NSFW datasets and compare the performance of different NSFW classifiers, offering deeper insights into their performance alongside a compilation of comprehensive evaluation metrics. Our benchmark not only systematically evaluates concept erasure methods, but also delves into the underlying factors influencing their performance at the insight level. By synthesizing insights from various evaluation perspectives, we provide a deeper understanding of the challenges and opportunities in the field, offering actionable guidance and inspiration for advancing research and practical applications in concept erasure.

Paper Structure

This paper contains 32 sections, 6 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: NSFW is divided into five themes. We provide descriptions for these five themes and include image examples for a more concrete illustration. Since the erasure methods use keyword sets as erasure targets, we also present the complete keyword set for the more keywords version and a more generalized less keywords version.
  • Figure 2: Our benchmark framework consists of three parts: assessment tools, assessment targets, and assessment content. In terms of assessment tools, we conduct toxicity analysis on the NSFW dataset and compare the accuracy of classifiers. These tools are used in assessment experiments for concept erasure methods, which are divided into two modes. To analyze the specific data requirements of each method, we differentiate between different versions of the methods. For assessment content, we categorize specific themes under NSFW and perform the analysis from six different perspectives.
  • Figure 3: Erasure scores ($\uparrow$) of different methods on five themes in two modes. Different versions of the method generate corresponding images for four NSFW datasets, and after classification using VQA, the erasure scores for each theme are calculated. A larger method coverage area indicates better performance.
  • Figure 4: Erasure scores ($\uparrow$) of different methods on the sexually explicit theme, obtained using the NudeNet classifier for body part recognition. Negative scores mean the result after erasing is worse. For methods related to Mode 1, we selected the more keyword version, and for methods related to Mode 2, we selected the 200 images version. A larger erasure score indicates better erasure performance of the method.
  • Figure 5: Different versions of different methods generate images using the COCO-10k dataset. We use FID and LPIPS to calculate image quality, and CLIP score and Image Reward to calculate semantic alignment. For image quality, smaller values of the metrics are better, while for semantic alignment, larger values are better.
  • ...and 1 more figures