Table of Contents
Fetching ...

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

Saemi Moon, Minjong Lee, Sangdon Park, Dongwoo Kim

TL;DR

This work introduces Holistic Unlearning Benchmark (HUB), a comprehensive framework to evaluate unlearning methods for text-to-image diffusion models across six dimensions (faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, efficiency) on 33 concepts (Celebrity, Style, IP, NSFW) with 16,000 prompts per concept. HUB pairs a large-scale prompt-generation pipeline with VLM-based concept detection to assess how well unlearning methods remove target concepts while preserving unrelated content, language portability, and resilience to adversarial prompts. Across seven baseline methods, HUB reveals that no method dominates all metrics, highlighting crucial tradeoffs between removing unwanted content and maintaining generation quality and alignment, particularly for NSFW concepts. By releasing its data and evaluation code, HUB provides a standardized, multi-faceted benchmark designed to spur development of more reliable and robust unlearning techniques with practical safety implications. The study also demonstrates the importance of holistic, cross-language and attack-sensitive evaluation in ensuring that unlearning generalizes beyond English prompts and narrow test sets.

Abstract

As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.

Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

TL;DR

This work introduces Holistic Unlearning Benchmark (HUB), a comprehensive framework to evaluate unlearning methods for text-to-image diffusion models across six dimensions (faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, efficiency) on 33 concepts (Celebrity, Style, IP, NSFW) with 16,000 prompts per concept. HUB pairs a large-scale prompt-generation pipeline with VLM-based concept detection to assess how well unlearning methods remove target concepts while preserving unrelated content, language portability, and resilience to adversarial prompts. Across seven baseline methods, HUB reveals that no method dominates all metrics, highlighting crucial tradeoffs between removing unwanted content and maintaining generation quality and alignment, particularly for NSFW concepts. By releasing its data and evaluation code, HUB provides a standardized, multi-faceted benchmark designed to spur development of more reliable and robust unlearning techniques with practical safety implications. The study also demonstrates the importance of holistic, cross-language and attack-sensitive evaluation in ensuring that unlearning generalizes beyond English prompts and narrow test sets.

Abstract

As text-to-image diffusion models gain widespread commercial applications, there are increasing concerns about unethical or harmful use, including the unauthorized generation of copyrighted or sensitive content. Concept unlearning has emerged as a promising solution to these challenges by removing undesired and harmful information from the pre-trained model. However, the previous evaluations primarily focus on whether target concepts are removed while preserving image quality, neglecting the broader impacts such as unintended side effects. In this work, we propose Holistic Unlearning Benchmark (HUB), a comprehensive framework for evaluating unlearning methods across six key dimensions: faithfulness, alignment, pinpoint-ness, multilingual robustness, attack robustness, and efficiency. Our benchmark covers 33 target concepts, including 16,000 prompts per concept, spanning four categories: Celebrity, Style, Intellectual Property, and NSFW. Our investigation reveals that no single method excels across all evaluation criteria. By releasing our evaluation code and dataset, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.
Paper Structure (53 sections, 5 equations, 13 figures, 36 tables)

This paper contains 53 sections, 5 equations, 13 figures, 36 tables.

Figures (13)

  • Figure 1: Holistic Unlearning Benchmark. HUB systematically evaluates unlearning methods across six key aspects, covering 33 target concepts categorized into four dimensions: Celebrity, Style, IP, and NSFW. HUB provides an extensive set of 16,000 prompts per concept.
  • Figure 2: Example images generated with prompt "a photo of banana" from the models where 'Pikachu' is removed. All images are generated from the same seed.
  • Figure 3: Response from the VLM-based concept detection framework, illustrating cases categorized as "Yes" for $\mathtt{IP}$.
  • Figure 4: Response from the VLM-based concept detection framework, illustrating cases categorized as "No" for $\mathtt{IP}$.
  • Figure 5: Response from the VLM-based concept detection framework, illustrating cases categorized as "Yes" for $\mathtt{Style}$.
  • ...and 8 more figures