Rethinking Robust Adversarial Concept Erasure in Diffusion Models
Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Xueming Li, Yue Zhan
TL;DR
This work tackles robust adversarial concept erasure in diffusion models by highlighting that prior adversarial samples do not adequately cover the target concept space due to neglecting semantic structure. It introduces S-GRACE, a semantics-guided framework that uses an LLM to initialize adversarial prompts and a semantics-aware objective to jointly optimize diffusion reconstruction and concept-space alignment, while training only the text encoder to preserve efficiency. Empirical results across NSFW, artistic styles, and object concepts show that S-GRACE improves erasure by at least 26% and reduces training time by ~90% while maintaining non-target concept fidelity, with strong transferability to other diffusion models. The method offers a practical, modular approach to concept unlearning in diffusion models, with potential for safer content generation in deployed systems.
Abstract
Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
