Table of Contents
Fetching ...

When Are Concepts Erased From Diffusion Models?

Kevin Lu, Nicky Kriplani, Rohit Gandikota, Minh Pham, David Bau, Chinmay Hegde, Niv Cohen

TL;DR

This work addresses whether concept erasure in diffusion models achieves true knowledge removal or merely redirects generation away from the target concept. It introduces two conceptual erasure models—guidance-based avoidance and destruction-based removal—and a comprehensive evaluation suite with optimization-, in-context-, training-free, steered-latent, and dynamic-probing modalities. The study finds that many erasure methods leave residual, recoverable knowledge under several probes, suggesting they act more like redirection than full unlearning, and demonstrates distinct erasure dynamics across methods. The findings advocate for rigorous, multi-perspective evaluation and provide a framework to benchmark and improve durable concept erasure in diffusion models.

Abstract

In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model's internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models.

When Are Concepts Erased From Diffusion Models?

TL;DR

This work addresses whether concept erasure in diffusion models achieves true knowledge removal or merely redirects generation away from the target concept. It introduces two conceptual erasure models—guidance-based avoidance and destruction-based removal—and a comprehensive evaluation suite with optimization-, in-context-, training-free, steered-latent, and dynamic-probing modalities. The study finds that many erasure methods leave residual, recoverable knowledge under several probes, suggesting they act more like redirection than full unlearning, and demonstrates distinct erasure dynamics across methods. The findings advocate for rigorous, multi-perspective evaluation and provide a framework to benchmark and improve durable concept erasure in diffusion models.

Abstract

In concept erasure, a model is modified to selectively prevent it from generating a target concept. Despite the rapid development of new methods, it remains unclear how thoroughly these approaches remove the target concept from the model. We begin by proposing two conceptual models for the erasure mechanism in diffusion models: (i) interfering with the model's internal guidance processes, and (ii) reducing the unconditional likelihood of generating the target concept, potentially removing it entirely. To assess whether a concept has been truly erased from the model, we introduce a comprehensive suite of independent probing techniques: supplying visual context, modifying the diffusion trajectory, applying classifier guidance, and analyzing the model's alternative generations that emerge in place of the erased concept. Our results shed light on the value of exploring concept erasure robustness outside of adversarial text inputs, and emphasize the importance of comprehensive evaluations for erasure in diffusion models.

Paper Structure

This paper contains 41 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: We suggest that diffusion model concept erasure methods can be broadly categorized into two types: (1) Guidance-Based Avoidance, which avoids a concept by redirecting the model to different concept locations. (2) Destruction-Based Removal, which reduces the unconditional likelihood of the target concept while keeping guidance intact, forcing the model to another concept when prompted with the target concept. The height represents the unconditional likelihood $P(X)$.
  • Figure 2: Inpainting-based probe results for multiple erased concepts. For each method and concept, the masked region is filled by the model conditioned on surrounding context. Task Vectors successfully reconstructs the erased region, despite robustness to Textual Inversion and UnlearnDiffAtk.
  • Figure 3: Diffusion Completion outputs given intermediate images generated at timestep $t$ by the original (unerased) model. These noisy inputs are visualized in the first column via the Denoising Trajectory (DT) gandikota2025distilling technique. We then pass each of these unfinished images as contextual inputs to the erased models to complete the remaining denoising steps.
  • Figure 4: Our Noise-Based probing technique adds additional noise to the diffusion trajectory. At every diffusion denoising timestep, we add back a controlled amount of noise to allow the model to search in a larger latent space.
  • Figure 5: An overview of erasing model behavior under adversarial probes and the Noise-Based probe. Our Noise-Based probe can recover the target concept ("church") even in cases where Textual Inversion and UnlearnDiffAtk fail.
  • ...and 4 more figures