Table of Contents
Fetching ...

CGCE: Classifier-Guided Concept Erasure in Generative Models

Viet Nguyen, Vishal M. Patel

TL;DR

CGCE introduces a lightweight, classifier-guided concept erasure framework that operates in the text-embedding space and refines prompts at inference without modifying base model weights. By predicting unsafe prompts with a compact cross-attention classifier and iteratively updating embeddings via a gradient-based refinement, CGCE achieves robust erasure against adversarial prompts while preserving safe content, and scales to multi-concept erasure through gradient aggregation. It demonstrates state-of-the-art safety performance across multiple T2I and T2V backbones with minimal utility loss, and its model-agnostic design enables broad applicability to current and future generative architectures. The approach offers practical safety gains for real-world deployment by providing a fast, training-free, plug-and-play safeguard that can be integrated into existing pipelines with negligible overhead.

Abstract

Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

CGCE: Classifier-Guided Concept Erasure in Generative Models

TL;DR

CGCE introduces a lightweight, classifier-guided concept erasure framework that operates in the text-embedding space and refines prompts at inference without modifying base model weights. By predicting unsafe prompts with a compact cross-attention classifier and iteratively updating embeddings via a gradient-based refinement, CGCE achieves robust erasure against adversarial prompts while preserving safe content, and scales to multi-concept erasure through gradient aggregation. It demonstrates state-of-the-art safety performance across multiple T2I and T2V backbones with minimal utility loss, and its model-agnostic design enables broad applicability to current and future generative architectures. The approach offers practical safety gains for real-world deployment by providing a fast, training-free, plug-and-play safeguard that can be integrated into existing pipelines with negligible overhead.

Abstract

Recent advancements in large-scale generative models have enabled the creation of high-quality images and videos, but have also raised significant safety concerns regarding the generation of unsafe content. To mitigate this, concept erasure methods have been developed to remove undesirable concepts from pre-trained models. However, existing methods remain vulnerable to adversarial attacks that can regenerate the erased content. Moreover, achieving robust erasure often degrades the model's generative quality for safe, unrelated concepts, creating a difficult trade-off between safety and performance. To address this challenge, we introduce Classifier-Guided Concept Erasure (CGCE), an efficient plug-and-play framework that provides robust concept erasure for diverse generative models without altering their original weights. CGCE uses a lightweight classifier operating on text embeddings to first detect and then refine prompts containing undesired concepts. This approach is highly scalable, allowing for multi-concept erasure by aggregating guidance from several classifiers. By modifying only unsafe embeddings at inference time, our method prevents harmful content generation while preserving the model's original quality on benign prompts. Extensive experiments show that CGCE achieves state-of-the-art robustness against a wide range of red-teaming attacks. Our approach also maintains high generative utility, demonstrating a superior balance between safety and performance. We showcase the versatility of CGCE through its successful application to various modern T2I and T2V models, establishing it as a practical and effective solution for safe generative AI.

Paper Structure

This paper contains 23 sections, 4 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: We present CGCE, an efficient plug-and-play framework for robust and high-fidelity concept erasure. Top:CGCE produces safer and higher-quality results compared to state-of-the-art baselines esdrecesafreestereo across diverse T2I erasure tasks, including nudity, artistic style, and object removal. Bottom: The cross-modal safety and versatility of CGCE, which can be seamlessly applied as a safeguard to a range of modern T2I and T2V models to ensure safe generation without altering their original weights. Sensitive content (*) has been masked for publication.
  • Figure 2: Overview of CGCE. Stage 1: LLM is used to create a dataset of paired prompts, each containing a safe prompt and a semantically similar unsafe version. Stage 2: A lightweight classifier is trained on the embeddings of these prompts to distinguish between safe and unsafe content. Stage 3: At inference time, the trained classifier acts as a plug-and-play safeguard. If an input prompt is safe, its embedding is passed directly to the generative model. If unsafe, the classifier then acts as a refiner, using its own gradients to iteratively modify the embedding. This process steers the embedding away from the harmful concept before it is passed to the T2I or T2V model to ensure a safe final output.
  • Figure 3: Qualitative evaluation of CGCE's effectiveness in erasing target concepts while preserving unrelated concepts, compared to baseline methods with SD-v1.4 backbone. Sensitive content (*) has been masked for publication.
  • Figure 4: Qualitative evaluation of CGCE's effectiveness in erasing nudity concepts, compared to baseline methods with modern T2I and T2V architectures. Sensitive content (*) has been masked for publication.
  • Figure 5: Qualitative evaluation of CGCE's effectiveness in multi-concept erasure with FLUX.1-dev model. Sensitive content (*) has been masked for publication.
  • ...and 12 more figures