Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
Anh Bui, Trang Vu, Long Vuong, Trung Le, Paul Montague, Tamas Abraham, Junae Kim, Dinh Phung
TL;DR
This work addresses the risk of harmful content in diffusion-based image generation by revisiting concept erasure. It reveals that mapping unwanted concepts to a fixed target is suboptimal due to cross-concept interactions, and demonstrates locality in the concept space using NetFive. The authors introduce Adaptive Guided Erasure (AGE), a minimax framework that automatically selects an optimal target concept for each erasure, further enriching targets as mixtures via a Gumbel-Softmax representation. Across object-related, NSFW, and artistic erasure tasks, AGE achieves superior preservation of benign concepts while effectively erasing undesired ones, outperforming state-of-the-art baselines. These insights advance practical, scalable, and safer diffusion-model deployment by better understanding and exploiting the geometry of concept space.
Abstract
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.
