Table of Contents
Fetching ...

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Xueming Li, Yue Zhan

TL;DR

This work tackles robust adversarial concept erasure in diffusion models by highlighting that prior adversarial samples do not adequately cover the target concept space due to neglecting semantic structure. It introduces S-GRACE, a semantics-guided framework that uses an LLM to initialize adversarial prompts and a semantics-aware objective to jointly optimize diffusion reconstruction and concept-space alignment, while training only the text encoder to preserve efficiency. Empirical results across NSFW, artistic styles, and object concepts show that S-GRACE improves erasure by at least 26% and reduces training time by ~90% while maintaining non-target concept fidelity, with strong transferability to other diffusion models. The method offers a practical, modular approach to concept unlearning in diffusion models, with potential for safer content generation in deployed systems.

Abstract

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

TL;DR

This work tackles robust adversarial concept erasure in diffusion models by highlighting that prior adversarial samples do not adequately cover the target concept space due to neglecting semantic structure. It introduces S-GRACE, a semantics-guided framework that uses an LLM to initialize adversarial prompts and a semantics-aware objective to jointly optimize diffusion reconstruction and concept-space alignment, while training only the text encoder to preserve efficiency. Empirical results across NSFW, artistic styles, and object concepts show that S-GRACE improves erasure by at least 26% and reduces training time by ~90% while maintaining non-target concept fidelity, with strong transferability to other diffusion models. The method offers a practical, modular approach to concept unlearning in diffusion models, with potential for safer content generation in deployed systems.

Abstract

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

Paper Structure

This paper contains 23 sections, 4 equations, 9 figures, 16 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of the performance between S-GRACE and several DM-based adversarial concept erasure baselines in removing the nudity concept under the SD v1.4.
  • Figure 1: For target concept "nudity", the adversarial samples generated by S-GRACE have the closest distance to the target concept space and the best erasure effect.
  • Figure 2: Visualization of target concept space fitting. S-GRACE generates adversarial samples closer to the true data distribution, demonstrating superior performance in fitting the target concept space.
  • Figure 3: Visualizations of generated images by various methods under the UnlearnDiff zhang2024ud attack.
  • Figure 4: S-GRACE disentanglement in concept erasure. It can effectively erase target concept without affecting the semantic similarity between images and prompts.
  • ...and 4 more figures