Table of Contents
Fetching ...

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang

Abstract

Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Abstract

Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
Paper Structure (20 sections, 25 equations, 11 figures, 5 tables)

This paper contains 20 sections, 25 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison of concept erasure strategies between (a) existing methods and (b) our method, OrthoEraser. (a) Existing methods typically treat sensitive concepts as spatially isolated. (b) OrthoEraser decouples entangled features via gradient orthogonalization, selectively removing sensitive components to preserve non-sensitive generation capabilities.
  • Figure 2: Overall framework of our proposed OrthoEraser, which consists of three key components: (i) sensitive neuron detection with sparse autoencoders (SAE) cunningham2023sparse, (ii) coupled neuron detection via zero-ablation, and (iii) sensitive information suppression with gradient orthogonalization which projects intervention vectors onto the null space of these coupled neurons, ensuring precise concept erasure.
  • Figure 3: Layer-wise Localization and Sensitive Neuron Identification. (Left) sensitive score (SS) distribution across layers used to identify target sensitive layers. (Right) $\Delta WFS$ values of the top-50 neurons in sensitive layers, representing the core feature units contributing to the sensitive concept in the SAE space.
  • Figure 4: Qualitative comparison with SOTA methods. Qualitative comparison of various safety guidance methods on the I2P dataset.
  • Figure 5: Neuron-Level ablation. Comparison between random neuron suppression and our targeted selection on the I2P.
  • ...and 6 more figures