Robust Concept Erasure Using Task Vectors

Minh Pham; Kelly O. Marshall; Chinmay Hegde; Niv Cohen

Robust Concept Erasure Using Task Vectors

Minh Pham, Kelly O. Marshall, Chinmay Hegde, Niv Cohen

TL;DR

This work addresses the brittleness of prompt-dependent concept erasure in text-to-image models by proposing unconditional safety via Task Vectors (TV). TV edits, defined by $oldsymbol{ au} = oldsymbol{ heta}_{ft} - oldsymbol{ heta}_{pre}$ and applied as $oldsymbol{ heta}_{pre} - oldsymbol{ au} imes oldsymbol{ ext{α}}$, remove target concepts without reliance on user prompts, and are enhanced by Diverse Inversion to estimate robust edit strength. By tuning $oldsymbol{ ext{α}}$ against a diverse set of adversarial prompts and selectively editing model weights, the method achieves stronger erasure while preserving core functionality. The results demonstrate improved resilience to adversarial inputs (including Ring-A-Bell prompts) and suggest that TV-based erasure, combined with weight pruning, can offer practical, robust concept erasure suitable for large diffusion models and potentially extendable to LLMs and other modalities.

Abstract

With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.

Robust Concept Erasure Using Task Vectors

TL;DR

This work addresses the brittleness of prompt-dependent concept erasure in text-to-image models by proposing unconditional safety via Task Vectors (TV). TV edits, defined by

and applied as

, remove target concepts without reliance on user prompts, and are enhanced by Diverse Inversion to estimate robust edit strength. By tuning

against a diverse set of adversarial prompts and selectively editing model weights, the method achieves stronger erasure while preserving core functionality. The results demonstrate improved resilience to adversarial inputs (including Ring-A-Bell prompts) and suggest that TV-based erasure, combined with weight pruning, can offer practical, robust concept erasure suitable for large diffusion models and potentially extendable to LLMs and other modalities.

Abstract

Paper Structure (20 sections, 7 equations, 12 figures, 2 tables)

This paper contains 20 sections, 7 equations, 12 figures, 2 tables.

Introduction
Related Work
Conditional and Unconditional Concept Erasure
Motivating analysis
Marginal, conditional, and absolute safety
Task Vectors for unconditional safety
Diverse Inversion for Robust Concept Erasure Using Task Vectors
Diverse Inversion
Tuning the TV edit strength
Sub-selecting TV weights
Experiments
Experimental setup
Results
Discussion
Limitations
...and 5 more sections

Figures (12)

Figure 1: Concept erasure methods often filter out only a tiny volume in input space. Top row: Erased Stable Diffusion (with the "Van Gogh" concept erased); bottom row: SD 1.4. We plot generations using various adversarially optimized prompt embeddings, located at different Cosine similarities from the embedding of the prompt "Van Gogh". Values in square brackets represent cosine similarities in embedding space with the prompt "Van Gogh" and are ordered from left (input is far away from the concept name) to right (closer to the concept name). ESD continues to produce "Van Gogh" concepts when the input prompt is far away from the original concept name.
Figure 2: Input-independent vs. Input-dependent concept erasure. Illustration of the probability distribution to generate the target concept "Van Gogh" across the input space. Images featuring the "Van Gogh" concept are framed in red, other images are framed in green. Input-dependent concept erasure leaves high probability areas of generating the target concept, while input-independent erasure methods erase the target concept across the entire input space. (Top) In generative T2I models, the probability of generating a specific concept is high for prompt embeddings close to the concept name, but high generation probability is possible also for prompts embedding in a significant distance from it. (Middle) Input-dependent concept-erasure attenuates the generation probability within a small environment of the given prompt but leaves a high probability of generating the erased concept further away from the prompt embedding. (Bottom) Input-independent erasure attenuates the probability of generating the target concept more consistently across the input space.
Figure 3: TV-based concept-erasure provides better unconditional safety. We plot the probability of unsafe generation with the most successful adversarial prompt from each given input complexity class (See Sec.\ref{['sec:unconditional_safety']}). While the input-dependent (finetune-based) concept erasure method is focused on protecting against undesired generations with a specific prompt, other prompts still produce unsafe generations with high probability. The input-independent (TV-based) erasure reduces the probability of unsafe generations compared both to the original and the input-independent models, across the different complexity classes.
Figure 4: The trade-off between erasure score and control task performance. We plot the robustness measured according to erasure score, lower is better, and control task performance, higher is better, for models erased with different TV edit strengths (parameterized by $\alpha$). Our Diverse Inversion method allows us to explore the trade-off between concept erasure robustness and model utility when editing different subsets of the model parameter. We discover that different target concepts may benefit from editing different subsets of model parameters.
Figure 7: TV based concept erasure robustness to Concept Inversion. A full TV edit is utilized to erase "Van Gogh" (Left) A pruned TV edit is utilized to erase "French Horn" (Right). We display three model variants (by row): the original model, and two models from which we removed the targeted concept using Task Vectors of different magnitudes. In both cases, TV-based erasure is robust against Concept Inversion (cce) and preserves the model utility on the control task. The third column demonstrates that TV preserves model performance on unrelated concepts.
...and 7 more figures

Robust Concept Erasure Using Task Vectors

TL;DR

Abstract

Robust Concept Erasure Using Task Vectors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)