Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Chao Gong; Kai Chen; Zhipeng Wei; Jingjing Chen; Yu-Gang Jiang

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang

TL;DR

Reliable and Efficient Concept Erasure is introduced, a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning and achieves more efficient and thorough erasure with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming tools.

Abstract

Text-to-image models encounter safety issues, including concerns related to copyright and Not-Safe-For-Work (NSFW) content. Despite several methods have been proposed for erasing inappropriate concepts from diffusion models, they often exhibit incomplete erasure, consume a lot of computing resources, and inadvertently damage generation ability. In this work, we introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. Specifically, RECE efficiently leverages a closed-form solution to derive new target embeddings, which are capable of regenerating erased concepts within the unlearned model. To mitigate inappropriate content potentially represented by derived embeddings, RECE further aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts. Besides, to preserve the model's generation ability, RECE introduces an additional regularization term during the derivation process, resulting in minimizing the impact on unrelated concepts during the erasure process. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only 3 seconds. Benchmarking against previous approaches, our method achieves more efficient and thorough erasure with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming tools. Code is available at \url{https://github.com/CharlesGong12/RECE}.

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (32 sections, 1 theorem, 20 equations, 18 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 20 equations, 18 figures, 7 tables, 1 algorithm.

Introduction
Related Work
T2I Diffusion Models with Safety Mechanisms.
Red-Teaming Tools for T2I Diffusion Models.
Method
Preliminaries
Text-to-Image (T2I) Diffusion Models
Concept Erasing with Closed-form Solution
Reliable and Efficient Concept Erasure (RECE)
Finding Target Contents
Regularization Term
Experiments
Unsafe Content Removal
Experimental Setup
Removal Results
...and 17 more sections

Key Result

theorem thmcountertheorem

If $c^\prime$ is set to $\mathbf{0}$, eq:reg achieves its global minimum of $0$.

Figures (18)

Figure 1: Overview of the proposed RECE. RECE consists of two main components: model editing and embedding derivation. First, erasing concepts by editing the model with a closed-form solution, and obtaining the edited cross-attention $W^{\mathrm{new}}$. Then, new embedding $c_i^\prime$ can be derived by \ref{['eq:c prime with reg']} given the original cross attention $W^{\mathrm{old}}$ and the edited $W^{\mathrm{new}}$. In subsequent epochs, model editing and embedding derivation are looped.
Figure 2: SD
Figure 3: nudity
Figure 4:
Figure 5: nudity
...and 13 more figures

Theorems & Definitions (1)

theorem thmcountertheorem

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

TL;DR

Abstract

Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (1)