Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Zixuan Fu; Yan Ren; Finn Carter; Chenyue Wen; Le Ku; Daheng Yu; Emily Davis; Bo Zhang

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

Zixuan Fu, Yan Ren, Finn Carter, Chenyue Wen, Le Ku, Daheng Yu, Emily Davis, Bo Zhang

TL;DR

This work tackles the privacy, safety, and fairness risks of diffusion models by proposing SCORE, a principled framework for robust concept erasure. SCORE minimizes the mutual information between an erased concept $C$ and generated outputs $X$ through a formal adversarial independence game, augmented by trajectory consistency and saliency-guided updates to preserve fidelity. Theoretical contributions establish convergence, upper bounds on residual leakage, and generalization guarantees, connecting erasure to information-theoretic independence. Empirically, SCORE outperforms state-of-the-art baselines on Stable Diffusion v1.5 and FLUX across object removal, NSFW suppression, celebrity erasure, and artistic style unlearning, achieving up to 12.5\% higher erasure efficacy while maintaining image quality and demonstrating robustness to adaptive prompts. Overall, SCORE provides a scalable, provably secure approach to concept erasure with meaningful implications for privacy, safety, and fairness in generative AI.

Abstract

Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model's outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5\%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

TL;DR

Abstract

Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (38)