Robust Concept Erasure Using Task Vectors
Minh Pham, Kelly O. Marshall, Chinmay Hegde, Niv Cohen
TL;DR
This work addresses the brittleness of prompt-dependent concept erasure in text-to-image models by proposing unconditional safety via Task Vectors (TV). TV edits, defined by $oldsymbol{ au} = oldsymbol{ heta}_{ft} - oldsymbol{ heta}_{pre}$ and applied as $oldsymbol{ heta}_{pre} - oldsymbol{ au} imes oldsymbol{ ext{α}}$, remove target concepts without reliance on user prompts, and are enhanced by Diverse Inversion to estimate robust edit strength. By tuning $oldsymbol{ ext{α}}$ against a diverse set of adversarial prompts and selectively editing model weights, the method achieves stronger erasure while preserving core functionality. The results demonstrate improved resilience to adversarial inputs (including Ring-A-Bell prompts) and suggest that TV-based erasure, combined with weight pruning, can offer practical, robust concept erasure suitable for large diffusion models and potentially extendable to LLMs and other modalities.
Abstract
With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user's prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.
