Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts
Ibtihel Amara, Ahmed Imtiaz Humayun, Ivana Kajic, Zarana Parekh, Natalie Harris, Sarah Young, Chirag Nagpal, Najoung Kim, Junfeng He, Cristina Nader Vasconcelos, Deepak Ramachandran, Golnoosh Farnadi, Katherine Heller, Mohammad Havaei, Negar Rostamzadeh
TL;DR
The paper investigates the reliability of concept erasure in text-to-image models by introducing EraseBench, a multi-dimensional benchmark that stress-tests erasure across visual similarity, artistic style, subset-superset, and binomial relationships, plus explicit content. It evaluates five state-of-the-art erasure methods and reveals consistent spillover where non-target concepts lose alignment and quality after erasure, with adversarial methods delivering stronger target forgetting but greater collateral damage. UCE often preserves non-erased representations better, while methods like Receler, MACE, and AdvUnlearn excel at erasing targets yet degrade non-target concepts and image quality, as evidenced by CLIP, Gecko, and RAHF metrics and human judgments. The authors discuss mitigation strategies, such as retain and anchor sets, but find them insufficient to fully resolve ripple effects, underscoring the need for more robust evaluation protocols and erasure approaches before real-world deployment.
Abstract
Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.
