AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

Mengnan Zhao; Lihe Zhang; Xingyi Yang; Tianhang Zheng; Baocai Yin

AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

Mengnan Zhao, Lihe Zhang, Xingyi Yang, Tianhang Zheng, Baocai Yin

TL;DR

This work tackles safety-related unlearning in text-guided diffusion systems by addressing a key trade-off: erasing undesired concepts without degrading overall generation quality. It first analyzes how different anchor designs influence unlearning performance and derives principles: anchors should be semantically similar to the undesired concept yet omit its defining attributes. Building on these insights, it introduces AdvAnchor, which creates adversarial anchors via a universal perturbation $\mathbf{e}_{\text{adv}}$ added to the undesired concept embedding and optimizes two loss terms, $L_{\text{adv1}}$ and $L_{\text{adv2}}$, to degrade undesirable concept generation while preserving retained concepts; an alternating/sequential/cyclical optimization framework updates $\mathbf{e}_{\text{adv}}$ and the unlearning weights $\theta_{op}$. Empirical results on Stable Diffusion show AdvAnchor outperforms state-of-the-art methods across style and object unlearning tasks, including explicit content removal, by achieving stronger erasure with comparable or better preservation metrics. The approach provides a practical and adaptable path to safer diffusion systems by leveraging adversarial anchors integrated with existing unlearning strategies, enabling targeted concept erasure with minimal collateral impact.

Abstract

Security concerns surrounding text-to-image diffusion models have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods typically align the prediction distributions of unsafe prompts with those of predefined text anchors. However, these techniques exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. In this paper, we systematically analyze the impact of diverse text anchors on unlearning performance. Guided by this analysis, we propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue. These adversarial anchors are crafted to closely resemble the embeddings of undesirable concepts to maintain overall model performance, while selectively excluding defining attributes of these concepts for effective erasure. Extensive experiments demonstrate that AdvAnchor outperforms state-of-the-art methods. Our code is publicly available at https://anonymous.4open.science/r/AdvAnchor.

AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

TL;DR

added to the undesired concept embedding and optimizes two loss terms,

and

, to degrade undesirable concept generation while preserving retained concepts; an alternating/sequential/cyclical optimization framework updates

and the unlearning weights

. Empirical results on Stable Diffusion show AdvAnchor outperforms state-of-the-art methods across style and object unlearning tasks, including explicit content removal, by achieving stronger erasure with comparable or better preservation metrics. The approach provides a practical and adaptable path to safer diffusion systems by leveraging adversarial anchors integrated with existing unlearning strategies, enabling targeted concept erasure with minimal collateral impact.

Abstract

Paper Structure (13 sections, 7 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 7 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related work
Proposed method
Impact of various anchors on DM unlearning
Anchors
Settings
Observations
Proposed AdvAnchor
Experiments
Experimental Details
Unlearning Evaluation
Ablation Studies
Conclusions

Figures (8)

Figure 1: Overview of the proposed AdvAnchor. To construct adversarial anchors, tiny adversarial perturbations that greatly affect the generation performance of DMs on 'Van Gogh' are added to the embeddings of 'Van Gogh'. $\bm{\theta}_\text{op}$ is fine-tuned by aligning the prediction distributions of $\bm{e}_\text{ori}$ with those of $\bm{e}_\text{anchor}$.
Figure 2: The denoising process of text-guided DMs. a) the encoder converts the input noise into latent representations $\bm{x}_\text{T}$; b) the denoiser iteratively removes the predicted noise $\bm{\epsilon}_{t\in[1,\text{N}]}$ from latent representations $\bm{x}_t^c$; c) the decoder reconstructs the image from the denoised representations $\bm{x}_0^c$.
Figure 3: Impact of using various types of words as anchors ($p_\text{anchor}^\text{word}$) on DM unlearning.
Figure 4: Ablation studies on the length of the shared sentence between $p_\text{anchor}$ and $p_\text{u}$ in DM unlearning.
Figure 5: Comparative experiments using $p_\text{anchor}^\text{word}$ and $p_\text{anchor}^\text{desc}$.
...and 3 more figures

AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

TL;DR

Abstract

AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors

Authors

TL;DR

Abstract

Table of Contents

Figures (8)