Table of Contents
Fetching ...

Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

Renyang Liu, Guanlin Li, Tianwei Zhang, See-Kiong Ng

TL;DR

This work targets the safety of diffusion-based image generation models by examining how robust unlearning methods are to multi-modal adversarial inputs. It introduces Recall, a multi-modal guided attack that optimizes adversarial image prompts in latent space in tandem with an unchanged text prompt, exploiting text+image conditioning to recover erased concepts from unlearned IGMs. Across ten SOTA unlearning approaches and four tasks, Recall achieves high attack success, low computational cost, and strong semantic fidelity, exposing critical vulnerabilities in current unlearning pipelines. The results motivate robustness auditing and the development of certifiable defenses to ensure reliable and safe generative systems in real-world deployments.

Abstract

Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

TL;DR

This work targets the safety of diffusion-based image generation models by examining how robust unlearning methods are to multi-modal adversarial inputs. It introduces Recall, a multi-modal guided attack that optimizes adversarial image prompts in latent space in tandem with an unchanged text prompt, exploiting text+image conditioning to recover erased concepts from unlearned IGMs. Across ten SOTA unlearning approaches and four tasks, Recall achieves high attack success, low computational cost, and strong semantic fidelity, exposing critical vulnerabilities in current unlearning pipelines. The results motivate robustness auditing and the development of certifiable defenses to ensure reliable and safe generative systems in real-world deployments.

Abstract

Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

Paper Structure

This paper contains 44 sections, 11 equations, 24 figures, 7 tables, 1 algorithm.

Figures (24)

  • Figure 1: Given an assumed successfully unlearned IGM $\mathcal{G}_u$, our adversarial image prompt $P^{adv}_{img}$ combined with the original sensitive text prompt $P_{text}$ as multi-modal guidance can circumvent the unlearning mechanism, leading to the reappearance of removed content $I^*$. Sensitive parts are covered by .
  • Figure 2: Overview of Recall. Given a reference image $P_{ref}$ that depicts the erased concept and a heavily noised initial image prompt $P_{img}^{init}$, we iteratively optimize the latent $z_{adv}$ (initialized from $P_{img}^{init}$) to align with the reference latent $z_{ref}$ under the same text condition. After optimization, $z_{adv}$ is decoded into an adversarial image $P_{img}^{adv}$, which is then paired with the original text prompt and fed into the unlearned model, enabling recovery of the erased concept, thereby exposing vulnerabilities of current unlearning mechanisms under multi-modal guidance.
  • Figure 3: Generated images under different attacks. Rows (top to bottom): Nudity, Van Gogh, Church, and Parachute.
  • Figure 4: Comparison of average attack time for different attack methods for Nudity task.
  • Figure 5: Comparison of average attack time (in seconds) for different attack methods across three unlearning tasks. The bar chart illustrates the attack efficiency of four attack approaches—P4D-N (blue), UnlearnDiffAtk (orange), WACE-C (red), and Recall (green)—against various unlearning techniques. A lower average attack time indicates higher efficiency.
  • ...and 19 more figures