Table of Contents
Fetching ...

Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

Ibtihel Amara, Ahmed Imtiaz Humayun, Ivana Kajic, Zarana Parekh, Natalie Harris, Sarah Young, Chirag Nagpal, Najoung Kim, Junfeng He, Cristina Nader Vasconcelos, Deepak Ramachandran, Golnoosh Farnadi, Katherine Heller, Mohammad Havaei, Negar Rostamzadeh

TL;DR

The paper investigates the reliability of concept erasure in text-to-image models by introducing EraseBench, a multi-dimensional benchmark that stress-tests erasure across visual similarity, artistic style, subset-superset, and binomial relationships, plus explicit content. It evaluates five state-of-the-art erasure methods and reveals consistent spillover where non-target concepts lose alignment and quality after erasure, with adversarial methods delivering stronger target forgetting but greater collateral damage. UCE often preserves non-erased representations better, while methods like Receler, MACE, and AdvUnlearn excel at erasing targets yet degrade non-target concepts and image quality, as evidenced by CLIP, Gecko, and RAHF metrics and human judgments. The authors discuss mitigation strategies, such as retain and anchor sets, but find them insufficient to fully resolve ripple effects, underscoring the need for more robust evaluation protocols and erasure approaches before real-world deployment.

Abstract

Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.

Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

TL;DR

The paper investigates the reliability of concept erasure in text-to-image models by introducing EraseBench, a multi-dimensional benchmark that stress-tests erasure across visual similarity, artistic style, subset-superset, and binomial relationships, plus explicit content. It evaluates five state-of-the-art erasure methods and reveals consistent spillover where non-target concepts lose alignment and quality after erasure, with adversarial methods delivering stronger target forgetting but greater collateral damage. UCE often preserves non-erased representations better, while methods like Receler, MACE, and AdvUnlearn excel at erasing targets yet degrade non-target concepts and image quality, as evidenced by CLIP, Gecko, and RAHF metrics and human judgments. The authors discuss mitigation strategies, such as retain and anchor sets, but find them insufficient to fully resolve ripple effects, underscoring the need for more robust evaluation protocols and erasure approaches before real-world deployment.

Abstract

Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
Paper Structure (19 sections, 21 figures, 13 tables)

This paper contains 19 sections, 21 figures, 13 tables.

Figures (21)

  • Figure 1: The effects of concept erasure on non-target concepts. Pre-erasure outputs (left) vs. post-erasure results (right) from Stable Diffusion (SD). Erasure negatively impacts the quality of unrelated concepts. EraseBench identifies such effects and offers a framework to evaluate the reliability of erasure methods.
  • Figure 2: Ripple effects of concept erasure methods across EraseBench entanglement dimensions. All the erasure baselines display failure cases across different EraseBench tasks. Receler and MACE frequently produce images that are unrelated to the text prompt, indicating they are the most sensitive of the five concept erasure techniques. In contrast, AdvUnlearn shows slightly better robustness across certain dimensions of the benchmark. For publication purposes, if the output appears more like a painting, the human faces remain unmasked; however, for more realistic depictions, the faces have been masked. The black square was added to indicate this masking.
  • Figure 3: Evaluation Dimensions of EraseBench.
  • Figure 4: Erasure affects fine-grained alignment. Prompt: "A tiger perched on a rocky outcrop surrounded by mountains and a serene blue sky." before (left) and after (right) erasing the concept cat using MACE lu2024mace).
  • Figure 5: Erasure introduces artifacts during subset-superset concept generation. We erase concept "goldfish" and generate images for the prompt "an image of a guppy". We present the RAHF artifact heatmaps for images generated post-erasure via AdvUnlearn and UCE. We see that the artifact introduced by each method can vary spatially and by intensity, which prompts our inclusion of the artifact score in EraseBench. More heatmaps examples can be found in the supplemental material.
  • ...and 16 more figures