OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi; Wenhua Wu; Fei Shen; Xiaogang Zhu; Kun Hu; Zhiyong Wang

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Chuancheng Shi, Wenhua Wu, Fei Shen, Xiaogang Zhu, Kun Hu, Zhiyong Wang

Abstract

Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Abstract

Paper Structure (20 sections, 25 equations, 11 figures, 5 tables)

This paper contains 20 sections, 25 equations, 11 figures, 5 tables.

Introduction
Related Work
Proposed Method
Sensitive Neuron Detection
Coupled Neuron Detection
Sensitive Information Suppression
Experiments and Discussions
Quantitative Comparison with SOTA Methods
Qualitative Comparison with SOTA Methods
Ablation Study
More Results and Analysis
Conclusion
Theoretical Analysis
Problem Formulation
Closed-Form Derivation via Method of Lagrange Multipliers
...and 5 more sections

Figures (11)

Figure 1: Comparison of concept erasure strategies between (a) existing methods and (b) our method, OrthoEraser. (a) Existing methods typically treat sensitive concepts as spatially isolated. (b) OrthoEraser decouples entangled features via gradient orthogonalization, selectively removing sensitive components to preserve non-sensitive generation capabilities.
Figure 2: Overall framework of our proposed OrthoEraser, which consists of three key components: (i) sensitive neuron detection with sparse autoencoders (SAE) cunningham2023sparse, (ii) coupled neuron detection via zero-ablation, and (iii) sensitive information suppression with gradient orthogonalization which projects intervention vectors onto the null space of these coupled neurons, ensuring precise concept erasure.
Figure 3: Layer-wise Localization and Sensitive Neuron Identification. (Left) sensitive score (SS) distribution across layers used to identify target sensitive layers. (Right) $\Delta WFS$ values of the top-50 neurons in sensitive layers, representing the core feature units contributing to the sensitive concept in the SAE space.
Figure 4: Qualitative comparison with SOTA methods. Qualitative comparison of various safety guidance methods on the I2P dataset.
Figure 5: Neuron-Level ablation. Comparison between random neuron suppression and our targeted selection on the I2P.
...and 6 more figures

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Abstract

OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure

Authors

Abstract

Table of Contents

Figures (11)