Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning
Saemi Moon, Minjong Lee, Sangdon Park, Dongwoo Kim
TL;DR
This work introduces Holistic Unlearning Benchmark (HUB), a comprehensive framework to evaluate unlearning methods for text-to-image diffusion models across six dimensions (faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, efficiency) on 33 concepts (Celebrity, Style, IP, NSFW) with 16,000 prompts per concept. HUB pairs a large-scale prompt-generation pipeline with VLM-based concept detection to assess how well unlearning methods remove target concepts while preserving unrelated content, language portability, and resilience to adversarial prompts. Across seven baseline methods, HUB reveals that no method dominates all metrics, highlighting crucial tradeoffs between removing unwanted content and maintaining generation quality and alignment, particularly for NSFW concepts. By releasing its data and evaluation code, HUB provides a standardized, multi-faceted benchmark designed to spur development of more reliable and robust unlearning techniques with practical safety implications. The study also demonstrates the importance of holistic, cross-language and attack-sensitive evaluation in ensuring that unlearning generalizes beyond English prompts and narrow test sets.
Abstract
As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.
