Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Xiaoxuan Han; Songlin Yang; Wei Wang; Yang Li; Jing Dong

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong

TL;DR

This work addresses the safety and privacy risks of text-to-image diffusion models by scrutinizing the trustworthiness of concept erasure methods. It introduces a transferable adversarial search strategy that locates embeddings capable of restoring erased concepts across unknown unlearning methods, using the original Stable Diffusion as a surrogate. The approach is formulated as a constrained Min-Max optimization and relies on alternating updates to embeddings and model parameters to navigate low-density regions in embedding space where erasure methods are less effective. Experiments demonstrate cross-method transferability and effectiveness across object, artist style, NSFW, and identity concepts, highlighting a practical path to assess and improve erasure robustness in diffusion models.

Abstract

Advanced text-to-image diffusion models raise safety concerns regarding identity privacy violation, copyright infringement, and Not Safe For Work content generation. Towards this, unlearning methods have been developed to erase these involved concepts from diffusion models. However, these unlearning methods only shift the text-to-image mapping and preserve the visual content within the generative space of diffusion models, leaving a fatal flaw for restoring these erased concepts. This erasure trustworthiness problem needs probe, but previous methods are sub-optimal from two perspectives: (1) Lack of transferability: Some methods operate within a white-box setting, requiring access to the unlearned model. And the learned adversarial input often fails to transfer to other unlearned models for concept restoration; (2) Limited attack: The prompt-level methods struggle to restore narrow concepts from unlearned models, such as celebrity identity. Therefore, this paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting. This challenging scenario assumes that the unlearning method is unknown and the unlearned model is inaccessible for optimization, requiring the attack to be capable of transferring across different unlearned models. Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models. This strategy adopts the original Stable Diffusion model as a surrogate model to iteratively erase and search for embeddings, enabling it to find the embedding that can restore the target concept for different unlearning methods. Extensive experiments demonstrate the transferability of the searched adversarial embedding across several state-of-the-art unlearning methods and its effectiveness for different levels of concepts.

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 8 equations, 14 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Text-to-Image Diffusion Models
Diffusion Unlearning for Concept Erasure
Adversarial Concept Restoration
Method
Preliminaries
Transferable Adversarial Search Strategy
Experiments
Experimental Setup
Comparisons with Baseline Methods
Ablation Study
Conclusions
APPENDIX
Implementation Details
...and 1 more sections

Figures (14)

Figure 1: The Adversarial Search (AS) strategy for concept restoration. We adopt the original Stable Diffusion model as a surrogate model to alternately erase and search for embeddings which can restore the target concepts.
Figure 2: The visualization of the embeddings obtained from different unlearned models for the restoration of "Barack Obama". Purple, yellow, cyan, and green points represent embeddings obtained from the models that have been unlearned by CA ca, FMN fmn, ESD esd, and UCE uce, respectively. The blue and red points represent the embeddings acquired from the original Stable Diffusion (SD) model, while the red ones are obtained with our Adversarial Search (AS) strategy.
Figure 3: The comparisons with different concept restoration methods for objects, encompassing both broad and narrow objects.
Figure 4: The comparisons with different concept restoration methods for artist styles and NSFW content.
Figure 5: The comparisons with different concept restoration methods for identities.
...and 9 more figures

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

TL;DR

Abstract

Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (14)