Table of Contents
Fetching ...

Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models

Aakash Sen Sharma, Niladri Sarkar, Vikram Chundawat, Ankur A Mali, Murari Mandal

TL;DR

This work addresses the problem of incomplete and misleading unlearning in diffusion models by performing a white-box analysis of five methods and introducing two latent-space evaluation metrics, CCS and CRS, to detect true erasure versus concealment. It formalizes an evaluation framework that probes latent representations through partial diffusion and adversarial perturbations, and demonstrates that existing metrics relying on final outputs (e.g., KID, CLIP) often fail to reveal latent traces of forgotten concepts. Across multiple concept types (art style, identity, NSFW), CCS/CRS reveal persistent leakage and residual traces that standard metrics overlook, and a critical partial-diffusion threshold around $\psi \approx 0.55$ shows how forgotten concepts can be recovered by a retrained model. The findings highlight the need for latent-space, adversarially robust evaluation when assessing unlearning in diffusion models, with practical implications for deploying safe, trustworthy generative systems; code for the framework is released by the authors.

Abstract

Recent research has seen significant interest in methods for concept removal and targeted forgetting in text-to-image diffusion models. In this paper, we conduct a comprehensive white-box analysis showing the vulnerabilities in existing diffusion model unlearning methods. We show that existing unlearning methods lead to decoupling of the targeted concepts (meant to be forgotten) for the corresponding prompts. This is concealment and not actual forgetting, which was the original goal. This paper presents a rigorous theoretical and empirical examination of five commonly used techniques for unlearning in diffusion models, while showing their potential weaknesses. We introduce two new evaluation metrics: Concept Retrieval Score (\textbf{CRS}) and Concept Confidence Score (\textbf{CCS}). These metrics are based on a successful adversarial attack setup that can recover \textit{forgotten} concepts from unlearned diffusion models. \textbf{CRS} measures the similarity between the latent representations of the unlearned and fully trained models after unlearning. It reports the extent of retrieval of the \textit{forgotten} concepts with increasing amount of guidance. CCS quantifies the confidence of the model in assigning the target concept to the manipulated data. It reports the probability of the \textit{unlearned} model's generations to be aligned with the original domain knowledge with increasing amount of guidance. The \textbf{CCS} and \textbf{CRS} enable a more robust evaluation of concept erasure methods. Evaluating existing five state-of-the-art methods with our metrics, reveal significant shortcomings in their ability to truly \textit{unlearn}. Source Code: \color{blue}{https://respailab.github.io/unlearning-or-concealment}

Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models

TL;DR

This work addresses the problem of incomplete and misleading unlearning in diffusion models by performing a white-box analysis of five methods and introducing two latent-space evaluation metrics, CCS and CRS, to detect true erasure versus concealment. It formalizes an evaluation framework that probes latent representations through partial diffusion and adversarial perturbations, and demonstrates that existing metrics relying on final outputs (e.g., KID, CLIP) often fail to reveal latent traces of forgotten concepts. Across multiple concept types (art style, identity, NSFW), CCS/CRS reveal persistent leakage and residual traces that standard metrics overlook, and a critical partial-diffusion threshold around shows how forgotten concepts can be recovered by a retrained model. The findings highlight the need for latent-space, adversarially robust evaluation when assessing unlearning in diffusion models, with practical implications for deploying safe, trustworthy generative systems; code for the framework is released by the authors.

Abstract

Recent research has seen significant interest in methods for concept removal and targeted forgetting in text-to-image diffusion models. In this paper, we conduct a comprehensive white-box analysis showing the vulnerabilities in existing diffusion model unlearning methods. We show that existing unlearning methods lead to decoupling of the targeted concepts (meant to be forgotten) for the corresponding prompts. This is concealment and not actual forgetting, which was the original goal. This paper presents a rigorous theoretical and empirical examination of five commonly used techniques for unlearning in diffusion models, while showing their potential weaknesses. We introduce two new evaluation metrics: Concept Retrieval Score (\textbf{CRS}) and Concept Confidence Score (\textbf{CCS}). These metrics are based on a successful adversarial attack setup that can recover \textit{forgotten} concepts from unlearned diffusion models. \textbf{CRS} measures the similarity between the latent representations of the unlearned and fully trained models after unlearning. It reports the extent of retrieval of the \textit{forgotten} concepts with increasing amount of guidance. CCS quantifies the confidence of the model in assigning the target concept to the manipulated data. It reports the probability of the \textit{unlearned} model's generations to be aligned with the original domain knowledge with increasing amount of guidance. The \textbf{CCS} and \textbf{CRS} enable a more robust evaluation of concept erasure methods. Evaluating existing five state-of-the-art methods with our metrics, reveal significant shortcomings in their ability to truly \textit{unlearn}. Source Code: \color{blue}{https://respailab.github.io/unlearning-or-concealment}
Paper Structure (22 sections, 8 theorems, 36 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 8 theorems, 36 equations, 17 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Given a fully trained diffusion model $\theta^o$ and an unlearned model $\theta^u$, there exists a partial diffusion ratio $\psi \in (0, 1)$ such that the unlearned concept can be recovered with high probability.

Figures (17)

  • Figure 1: The proposed partial diffusion process to extract forgotten concepts from the unlearned model.
  • Figure 2: Probing existing unlearning methods with partial diffusion to generate the unlearned concepts. $1^{st}$ row denotes the denoised output generated by the fully trained model. The $2_{nd}$ row is generated by the unlearned model using the image guidance of the fully trained model.
  • Figure 3: We show softmax and cosine similarity values at different partial diffusion ratio in $\mathcal{CCS}$ (c) and $\mathcal{CRS}$ (d). Cosine similarity is computed between $\lambda_\mathcal{P}$ (partially diffused knowledge) to $\lambda_\mathcal{O}$ (original domain knowledge) for original knowledge and $\lambda_\mathcal{P}$ to $\lambda_\mathcal{U}$ (unlearned domain knowledge) for unlearned knowledge. We also show mean-KID scores (e). KID-score is unable to differentiate between concealment and unlearning. $\mathcal{CCS}$, $\mathcal{CRS}$ indicate concealment rather than unlearning. Method: ESD-u gandikota2024unified. Prompt: "A nude woman with large breasts" (forget concept prompt)
  • Figure 4: We observe that in certain scenarios SafeGen li2024safegen fails to guardrail against our partial diffusion based attacks.
  • Figure 5: Effect of partial diffusion using original model and retrained (gold) model. Prompt: "A desert-rose". Original and retrained (gold) model trained from a flower dataset, available at: https://huggingface.co/datasets/pranked03/flowers-blip-captions
  • ...and 12 more figures

Theorems & Definitions (13)

  • Proposition 1
  • Lemma 1.1
  • Proposition 2
  • Proposition 3
  • proof
  • Lemma 3.1
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • ...and 3 more