Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu; Shiji Zhou; Mingzhuo Yang; Lianzhe Wang; Heng Chang; Wenbo Zhu; Xinting Hu; Xiao Zhou; Xu Yang

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Heng Chang, Wenbo Zhu, Xinting Hu, Xiao Zhou, Xu Yang

TL;DR

This work tackles the generalization gap and utility degradation in concept unlearning for diffusion models trained on large, potentially sensitive datasets. It introduces DoCo, a dual-component framework combining Concept Domain Correction via adversarial alignment of target and anchor concept outputs and a Concept Preserving Gradient that performs gradient surgery to avoid conflicting updates. The method demonstrates superior unlearning performance across instances, styles, and celebrities while preserving related concepts and enabling effective generalization to out-of-distribution prompts. The findings suggest practical pathways for responsible diffusion model deployment with stronger protection against sensitive concepts without sacrificing generative quality.

Abstract

Text-to-image diffusion models have achieved remarkable success in generating photorealistic images. However, the inclusion of sensitive information during pre-training poses significant risks. Machine Unlearning (MU) offers a promising solution to eliminate sensitive concepts from these models. Despite its potential, existing MU methods face two main challenges: 1) limited generalization, where concept erasure is effective only within the unlearned set, failing to prevent sensitive concept generation from out-of-set prompts; and 2) utility degradation, where removing target concepts significantly impacts the model's overall performance. To address these issues, we propose a novel concept domain correction framework named \textbf{DoCo} (\textbf{Do}main \textbf{Co}rrection). By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts. Additionally, we introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts. Extensive experiments across various instances, styles, and offensive concepts demonstrate the effectiveness of our method in unlearning targeted concepts with minimal impact on related concepts, outperforming previous approaches even for out-of-distribution prompts.

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 4 figures, 3 tables)

This paper contains 20 sections, 10 equations, 4 figures, 3 tables.

Introduction
Related Work
Dataset Filtering
Model Fine-tuning
Post-generation Classification
Method
Diffusion Models
Concept Unlearning Formulation
Concept Domain Correction
Concept Preserving Gradient
Experiments
Implement Details
Evaluation metrics
Main Results
Removal of Instance.
...and 5 more sections

Figures (4)

Figure 1: When transferred to strongly related out-of-distribution prompts, previous methods fail to unlearn successfully, whereas our method achieves this generalization. Left: unlearning "Van Gogh". Right: unlearning "Dog".
Figure 2: (a) The overall architecture of DoCo, which updates the model parameters through an adversarial training process. This process compels the diffusion model (acting as the generator) to produce denoised results that the discriminator cannot reliably classify as being associated with the target concept, such as "Grumpy Cat", or the anchor concept, such as "Cat". (b) If the unlearning gradient $\mathbf{G}_{u}$ does not conflict with the retraining gradient $\mathbf{G}_{r}$, we update the parameters in the direction of $\mathbf{G}_{u}$. If $\mathbf{G}_{u}$ conflicts with $\mathbf{G}_{r}$, we mitigate the contradictory gradient between them.
Figure 3: Visualization examples of instance unlearning. The prompts for image generation are displayed at the top. Concepts that have been unlearned are indicated in red text on the left side of the images.
Figure 4: Visualization examples of artistic styles unlearning. Left: Unlearning "Van Gogh". Right: Unlearning "Picasso". The first row represents the forgotten style, while the subsequent rows represent other non-target concepts.

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

TL;DR

Abstract

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Authors

TL;DR

Abstract

Table of Contents

Figures (4)