Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller; Florent Draye; Bernhard Schölkopf

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller, Florent Draye, Bernhard Schölkopf

TL;DR

The paper tackles identifiability and interpretability of features learned by a sparse autoencoder embedded in a language-model fine-tuning pipeline. It introduces an almost-orthogonal decoder via an orthogonality penalty to disentangle features while preserving task performance. The authors connect identifiability theory and finite frame theory to show why orthogonality improves intervenability and reduces feature superposition, and they validate this with experiments showing maintained math-reasoning performance, increased feature distinctness, and successful local interventions. The approach yields more diverse explanations and enables swapping concepts with controlled effects, suggesting practical benefits for modular, causally interpretable representations.

Abstract

With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

TL;DR

Abstract

principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under

Paper Structure (27 sections, 2 theorems, 20 equations, 8 figures)

This paper contains 27 sections, 2 theorems, 20 equations, 8 figures.

Introduction
Background
Notation
Identifiable Dictionary Learning
Sparse Autoencoders
Finite Frame Theory
Aligned Features Reduce Intervenability
Experiments
Almost Orthogonality While Keeping Performance
The Penalty Does Not Impact Interpretability
Almost Orthogonality Incentivizes Distinct Features
Localized Interventions Work
Related Work
Identifiability
Interpretability
...and 12 more sections

Key Result

Theorem 2.1

Let ${{{ \mathbf{D}}}} \in \mathbb{R}^{m \times d}$ have unit-norm columns and denote by $\mu$ self-coherence as defined in eqn:self-coherence. If every $\mathrm{K}$-sparse representation $\Tilde{{{{ \mathbf{x}}}}} = {{{ \mathbf{D}}}} {{{ \mathbf{z}}}}$ is unique. That is, for ${{{ \mathbf{D}}}} {{{ \mathbf{z}}}} = {{{ \mathbf{D}}}} \tilde{{{{ \mathbf{z}}}}}$ with $\|{{{ \mathbf{z}}}}\|_0, \|\til

Figures (8)

Figure 1: Intervention on SAE Feature We query a LM fine-tuned on the orthogonality penalty $10^{-4}$. At inference, we exchange the feature associated with $\textbf{Jerry}$ with the feature corresponding to the prefix $\textbf{aqua}$. The model then substitutes $\textbf{Jerry}$ for $\textbf{Aquaman}$ while maintaining its reasoning capabilities.
Figure 2: Orthogonality Evaluation Loss We plot the orthogonality loss $\|\mathrm{tril}({{{ \mathbf{D}}}}^\top {{{ \mathbf{D}}}})\|_\text{F}^2$ for all values of $\lambda$. For each $\lambda$, we evaluate on a subset of $1'024$ active features in the decoder. Error bars represent confidence intervals obtained from running $100$ evaluations.
Figure 3: Evaluation on GSM8KWe evaluate on the $\texttt{GSM8K}$ test set. Error bars represent the basic bootstrap confidence intervals efron1979basicbootstrap on $100$ randomly drawn datasets.
Figure 4: Interpretability Score We plot the interpretability score of correctly identifying one out of five examples, which relates closest to the provided explanation. Error bars are basic bootstrap confidence intervals efron1979basicbootstrap with $100$ resamples.
Figure 5: Embedding Explanations We plot the cosine similarity between embedded feature explanations against the orthogonality penalty $\lambda$. The error bars are basic bootstrap intervals efron1979basicbootstrap on $100$ sampled subsets.
...and 3 more figures

Theorems & Definitions (5)

Theorem 2.1: Self-coherence bound for uniqueness
Theorem 3.1: Post-Intervention Interference Between Features
proof
Definition 3.1: Effect of Interference on Feature
proof

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

TL;DR

Abstract

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (5)