Table of Contents
Fetching ...

Safe and Reliable Diffusion Models via Subspace Projection

Huiqiang Chen, Tianqing Zhu, Linlin Wang, Xin Yu, Longxiang Gao, Wanlei Zhou

TL;DR

Diffusion-based text-to-image systems risk generating inappropriate content due to memorized concepts. The authors propose SAFER, a concept-subspace projection method that learns a concept subspace in the text-embedding space using textual inversion from a reference image, identifies the dominant basis via SVD, and erases the concept by projecting prompts onto the complementary subspace with a projection $P=I-U_1U_1^T$; they further progressively expand the subspace to cover broader variations and inject the concept when desired. SAFER is training-free, integrates into cross-attention, and supports multi-concept erasure, achieving thorough removal of artistic styles, objects, and nudity while preserving overall image quality and enabling generalization to synonyms. Ablations and extensive experiments demonstrate the necessity of textual inversion and subspace expansion for robust erasure and show strong performance relative to state-of-the-art baselines across style, object, and explicit-content categories. This approach provides an efficient, deployable safety mechanism for diffusion-models with practical impact on content moderation and privacy-preserving model adaptation.

Abstract

Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace $S_c$ associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of $S_c$, effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.

Safe and Reliable Diffusion Models via Subspace Projection

TL;DR

Diffusion-based text-to-image systems risk generating inappropriate content due to memorized concepts. The authors propose SAFER, a concept-subspace projection method that learns a concept subspace in the text-embedding space using textual inversion from a reference image, identifies the dominant basis via SVD, and erases the concept by projecting prompts onto the complementary subspace with a projection ; they further progressively expand the subspace to cover broader variations and inject the concept when desired. SAFER is training-free, integrates into cross-attention, and supports multi-concept erasure, achieving thorough removal of artistic styles, objects, and nudity while preserving overall image quality and enabling generalization to synonyms. Ablations and extensive experiments demonstrate the necessity of textual inversion and subspace expansion for robust erasure and show strong performance relative to state-of-the-art baselines across style, object, and explicit-content categories. This approach provides an efficient, deployable safety mechanism for diffusion-models with practical impact on content moderation and privacy-preserving model adaptation.

Abstract

Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh's style when explicitly prompted with 'Van Gogh', yet still reproduce his signature artwork when given the prompt 'Starry Night'. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of , effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.

Paper Structure

This paper contains 21 sections, 16 equations, 10 figures, 3 tables, 3 algorithms.

Figures (10)

  • Figure 1: Results on removing Van Gogh's style. Existing erasing methods remain incomplete—when replacing "Van Gogh" with "Vincent Gogh" or prompting "Starry Night", the generated images still retain Van Gogh's style. In contrast, our approach effectively removes the target style by projecting the prompt's text embedding onto the complementary subspace of the concept embedding
  • Figure 2: Overview of the Proposed Framework. (a) Starting with a reference image containing the target concept, we invert its visual characteristics into a specialized token $\mathcal{T}_c$ paired with an optimized embedding within the text encoder's vocabulary. (b) The concept subspace is estimated in the text embedding space by generating prompts that combine prompt templates with the specialized token derived from the reference image. The first principal component of these prompt embeddings forms the basis of the target concept subspace. (c) The concept subspace is progressively expanded to approximate the target concept more comprehensively, ensuring a thorough erasure. (d) After identifying the target concept subspace, the text embedding is projected onto its complementary subspace. The resulting projection matrix is integrated into the weights of the cross-attention layers.
  • Figure 3: SVD results of text embeddings. The explained variance of the first component is significantly higher than that of the second, indicating a strong correlation between the primary components of the signal, particularly for (a) the style of Monet and (b) the object Airplane.
  • Figure 4: Paintings of Van Gogh in different styles. Using one image or a unified style to describe them all is hard. As such, we propose to progressively expand the subspace to include different styles of Van Gogh.
  • Figure 5: Style Information Captured via a subspace. We use different projection matrixes in the form of $I-U_jU_j^T$ in removing the Van Gogh's style. The generated images demonstrate that the style information can be captured via the subspace spanned by the first principal component of SVD results. The prompts used for image generation are displayed above each image.
  • ...and 5 more figures