Rethinking the Vulnerability of Concept Erasure and a New Method
Alex D. Richardson, Kaicheng Zhang, Lucas Beerens, Dongdong Chen
TL;DR
This work investigates why concept erasure in text-to-image diffusion models remains vulnerable to restoration attacks. It shows that adversarial prompt embeddings are pervasive in the embedding space, largely inherited from the original model, not just from the erasure method. The authors introduce RECORD, a two-stage token-level restoration algorithm based on coordinate descent that avoids projection and consistently outperforms prior methods by up to 17.8x in attack success rate while offering flexible compute–performance tradeoffs. Their comprehensive experiments cover multiple erasure techniques, model scales, and transferability, highlighting both the practical risk to erased concepts and the need for more robust unlearning and defense strategies.
Abstract
The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. In response, concept erasure (defense) methods have been developed to "unlearn" specific concepts through post-hoc finetuning. However, recent concept restoration (attack) methods have demonstrated that these supposedly erased concepts can be recovered using adversarially crafted prompts, revealing a critical vulnerability in current defense mechanisms. In this work, we first investigate the fundamental sources of adversarial vulnerability and reveal that vulnerabilities are pervasive in the prompt embedding space of concept-erased models, a characteristic inherited from the original pre-unlearned model. Furthermore, we introduce **RECORD**, a novel coordinate-descent-based restoration algorithm that consistently outperforms existing restoration methods by up to 17.8 times. We conduct extensive experiments to assess its compute-performance tradeoff and propose acceleration strategies.
