Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

Anh Bui; Long Vuong; Khanh Doan; Trung Le; Paul Montague; Tamas Abraham; Dinh Phung

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

Anh Bui, Long Vuong, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung

TL;DR

This work tackles the safety challenge of text-to-image diffusion models by erasing undesirable concepts while preserving others. It introduces Adversarial Concept Preservation, a bilevel framework that identifies and preserve the most sensitive concepts (c_a) affected by erasing target concepts (c_e), using a CLIP-based analysis to quantify impact. The method employs a discrete-to-continuous search via Gumbel-Softmax to optimize over the concept space and uses L1 erasure and L2 preservation losses, achieving superior erasure quality with minimized degradation of unrelated concepts across object, NSFW, and artistic domains. The results on Stable Diffusion show robust performance gains over state-of-the-art methods, with practical implications for safer and more controllable diffusion-based content generation.

Abstract

Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textit{adversarial concepts}. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at https://github.com/tuananhbui89/Erasing-Adversarial-Preservation.

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

TL;DR

Abstract

Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)