Table of Contents
Fetching ...

Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them

Anh Bui, Trang Vu, Long Vuong, Trung Le, Paul Montague, Tamas Abraham, Junae Kim, Dinh Phung

TL;DR

This work addresses the risk of harmful content in diffusion-based image generation by revisiting concept erasure. It reveals that mapping unwanted concepts to a fixed target is suboptimal due to cross-concept interactions, and demonstrates locality in the concept space using NetFive. The authors introduce Adaptive Guided Erasure (AGE), a minimax framework that automatically selects an optimal target concept for each erasure, further enriching targets as mixtures via a Gumbel-Softmax representation. Across object-related, NSFW, and artistic erasure tasks, AGE achieves superior preservation of benign concepts while effectively erasing undesired ones, outperforming state-of-the-art baselines. These insights advance practical, scalable, and safer diffusion-model deployment by better understanding and exploiting the geometry of concept space.

Abstract

Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.

Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them

TL;DR

This work addresses the risk of harmful content in diffusion-based image generation by revisiting concept erasure. It reveals that mapping unwanted concepts to a fixed target is suboptimal due to cross-concept interactions, and demonstrates locality in the concept space using NetFive. The authors introduce Adaptive Guided Erasure (AGE), a minimax framework that automatically selects an optimal target concept for each erasure, further enriching targets as mixtures via a Gumbel-Softmax representation. Across object-related, NSFW, and artistic erasure tasks, AGE achieves superior preservation of benign concepts while effectively erasing undesired ones, outperforming state-of-the-art baselines. These insights advance practical, scalable, and safer diffusion-model deployment by better understanding and exploiting the geometry of concept space.

Abstract

Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.

Paper Structure

This paper contains 50 sections, 6 equations, 25 figures, 9 tables, 1 algorithm.

Figures (25)

  • Figure 1: Analysis of the impact of choosing empty concept as the target concept for erasure. Complete details are provided in Figure \ref{['fig:netfive_impact_empty_full_ds5']}.
  • Figure 2: Analysis of the impact of choosing a specific concept as the target concept for erasure. Full details are provided in Figure \ref{['fig:netfive_impact_specific_ds5']}.
  • Figure 3: Number of exposed body parts counted in all generated images with threshold 0.5.
  • Figure 4: Top/a: Intermediate results of the search process, with images generated from the most sensitive concepts $c_t$ found by our method and $c_e$ at the same optimization step. Bottom/b: Similarity between nudity attributes and keywords.
  • Figure 5: CLIP and LPIPS scores of the artistic style erasure task. ($^\ast$) LPIPS at x-axis is scaled by 34 for better visualization. Full results are in Table \ref{['tab:artistic_style_erasing']}.
  • ...and 20 more figures