Safe and Reliable Diffusion Models via Subspace Projection
Huiqiang Chen, Tianqing Zhu, Linlin Wang, Xin Yu, Longxiang Gao, Wanlei Zhou
TL;DR
Diffusion-based text-to-image systems risk generating inappropriate content due to memorized concepts. The authors propose SAFER, a concept-subspace projection method that learns a concept subspace in the text-embedding space using textual inversion from a reference image, identifies the dominant basis via SVD, and erases the concept by projecting prompts onto the complementary subspace with a projection $P=I-U_1U_1^T$; they further progressively expand the subspace to cover broader variations and inject the concept when desired. SAFER is training-free, integrates into cross-attention, and supports multi-concept erasure, achieving thorough removal of artistic styles, objects, and nudity while preserving overall image quality and enabling generalization to synonyms. Ablations and extensive experiments demonstrate the necessity of textual inversion and subspace expansion for robust erasure and show strong performance relative to state-of-the-art baselines across style, object, and explicit-content categories. This approach provides an efficient, deployable safety mechanism for diffusion-models with practical impact on content moderation and privacy-preserving model adaptation.
Abstract
Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.
