Table of Contents
Fetching ...

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

Anh Bui, Long Vuong, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung

TL;DR

This work tackles the safety challenge of text-to-image diffusion models by erasing undesirable concepts while preserving others. It introduces Adversarial Concept Preservation, a bilevel framework that identifies and preserve the most sensitive concepts (c_a) affected by erasing target concepts (c_e), using a CLIP-based analysis to quantify impact. The method employs a discrete-to-continuous search via Gumbel-Softmax to optimize over the concept space and uses L1 erasure and L2 preservation losses, achieving superior erasure quality with minimized degradation of unrelated concepts across object, NSFW, and artistic domains. The results on Stable Diffusion show robust performance gains over state-of-the-art methods, with practical implications for safer and more controllable diffusion-based content generation.

Abstract

Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textit{adversarial concepts}. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at https://github.com/tuananhbui89/Erasing-Adversarial-Preservation.

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

TL;DR

This work tackles the safety challenge of text-to-image diffusion models by erasing undesirable concepts while preserving others. It introduces Adversarial Concept Preservation, a bilevel framework that identifies and preserve the most sensitive concepts (c_a) affected by erasing target concepts (c_e), using a CLIP-based analysis to quantify impact. The method employs a discrete-to-continuous search via Gumbel-Softmax to optimize over the concept space and uses L1 erasure and L2 preservation losses, achieving superior erasure quality with minimized degradation of unrelated concepts across object, NSFW, and artistic domains. The results on Stable Diffusion show robust performance gains over state-of-the-art methods, with practical implications for safer and more controllable diffusion-based content generation.

Abstract

Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textit{adversarial concepts}. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at https://github.com/tuananhbui89/Erasing-Adversarial-Preservation.

Paper Structure

This paper contains 29 sections, 5 equations, 14 figures, 8 tables, 2 algorithms.

Figures (14)

  • Figure 1: Analysis of the impact of erasing the target concept on the model's capability. The impact is measured by the difference of CLIP score $\delta(c)$ between the original model and the corresponding sanitized model. \ref{['fig:compare-erasing-impact']}: Impact of erasing "nudity" or "garbage truck" to other concepts. \ref{['fig:impact-choosing-correct-2']}: Comparing the impact of erasing the same "garbage truck" to other concepts with different preserving strategies, including preserving a fixed concept such as " ", "lexus", or "road", and adaptively preserving the most sensitive concept found by our method.
  • Figure 2: Sensitivity spectrum of concepts to the target concept "nudity". The histogram shows the distribution of the similarity score between outputs of the original model $\theta$ and the corresponding sanitized model $\theta_{c_e}'$ for each concept $c$ from the CLIP tokenizer vocabulary.
  • Figure 3: Comparing the impact of erasing the same "nudity" to other concepts with different preserving strategies.
  • Figure 4: Images generated from the most sensitive concepts found by our method over the fine-tuning process. Top: Continous search with PGD. Bottom: Discrete search with Gumbel-Softmax. $c_a$ represents for the keyword.
  • Figure 5: Comparison of the erasing performance on the I2P dataset. \ref{['fig:exposed_body_parts_stacked']}: Number of exposed body parts counted in all generated images with threshold 0.5. \ref{['fig:exposed_nudity']}: Ratio of images with any exposed body parts detected by the detector nudenet2019.
  • ...and 9 more figures