Table of Contents
Fetching ...

Rethinking the Vulnerability of Concept Erasure and a New Method

Alex D. Richardson, Kaicheng Zhang, Lucas Beerens, Dongdong Chen

TL;DR

This work investigates why concept erasure in text-to-image diffusion models remains vulnerable to restoration attacks. It shows that adversarial prompt embeddings are pervasive in the embedding space, largely inherited from the original model, not just from the erasure method. The authors introduce RECORD, a two-stage token-level restoration algorithm based on coordinate descent that avoids projection and consistently outperforms prior methods by up to 17.8x in attack success rate while offering flexible compute–performance tradeoffs. Their comprehensive experiments cover multiple erasure techniques, model scales, and transferability, highlighting both the practical risk to erased concepts and the need for more robust unlearning and defense strategies.

Abstract

The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies.

Rethinking the Vulnerability of Concept Erasure and a New Method

TL;DR

This work investigates why concept erasure in text-to-image diffusion models remains vulnerable to restoration attacks. It shows that adversarial prompt embeddings are pervasive in the embedding space, largely inherited from the original model, not just from the erasure method. The authors introduce RECORD, a two-stage token-level restoration algorithm based on coordinate descent that avoids projection and consistently outperforms prior methods by up to 17.8x in attack success rate while offering flexible compute–performance tradeoffs. Their comprehensive experiments cover multiple erasure techniques, model scales, and transferability, highlighting both the practical risk to erased concepts and the need for more robust unlearning and defense strategies.

Abstract

The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 19 tables, 1 algorithm.

Figures (4)

  • Figure 1: a) Examples images from models unlearned on van Gogh painting style. b) The update schematic of RECORD, which uses a linear gradient approximation to obtain a small set of candidate tokens, and then updates the prompt with respect to the exact evaluation of the loss function.
  • Figure 2: Behavior of the text embeddings during embedding-level attacks on models unlearned with ESD and AdvUnlearn. a) Isomap projection of the optimization trajectories in the prompt embedding space $\mathbb{R}^{T\times77\times768}$ down to $\mathbb{R}^{T\times2}$. 2000 trajectories shown, each $T=10$ steps long. Dots / crosses denote the starting point. The erased concept can be generated at the end of each trajectory. b), c), d) present the cosine similarity histogram, computed in $\mathbb{R}^{77\times768}$, between the initial, optimized, and reference target embeddings.
  • Figure 3: The mean runtime and the standard deviation (annotated as error bars) of different restoration methods, computed over 10 runs and at sequence lengths $S=8,16,32,64$. We note it is possible to achieve substantial acceleration by lowering gradient token number $J$, with only marginal performance loss, as discussed in Appendix \ref{['app:gradient_candidate_token']}.
  • Figure 4: Behavior of the text embeddings during embedding-level attacks on original model. a) Isomap projection of the optimization trajectories of the prompt embeddings. Dots denote the starting point of the trajectory. The target concept can be generated at the end of each trajectory. b), c), d) are smoothened cosine similarity histograms between the initial, optimized, and target embeddings.