Table of Contents
Fetching ...

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

Yiming Tang, Harshvardhan Saini, Yizhen Liao, Dianbo Liu

TL;DR

This work presents the first unified theoretical framework for Sparse Dictionary Learning (SDL) in mechanistic interpretability, treating SAEs, transcoders, and crosscoders as instances of a single optimization problem under the Linear Representation Hypothesis and Superposition Hypothesis. It proves the existence of approximate global minima with reconstruction error scaling as O(ε^2) and derives necessary and sufficient conditions for global optimality under extreme sparsity, effectively decomposing the problem into per-feature constraints. The analysis further links spurious local minima to feature absorption, offering a rigorous explanation for observed phenomena such as dead neurons and neuron resampling, and validates key predictions with controlled experiments. Collectively, the paper provides a principled foundation for understanding when SDL methods recover ground-truth interpretable features and how their optimization landscapes relate to empirical interpretability phenomena.

Abstract

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

TL;DR

This work presents the first unified theoretical framework for Sparse Dictionary Learning (SDL) in mechanistic interpretability, treating SAEs, transcoders, and crosscoders as instances of a single optimization problem under the Linear Representation Hypothesis and Superposition Hypothesis. It proves the existence of approximate global minima with reconstruction error scaling as O(ε^2) and derives necessary and sufficient conditions for global optimality under extreme sparsity, effectively decomposing the problem into per-feature constraints. The analysis further links spurious local minima to feature absorption, offering a rigorous explanation for observed phenomena such as dead neurons and neuron resampling, and validates key predictions with controlled experiments. Collectively, the paper provides a principled foundation for understanding when SDL methods recover ground-truth interpretable features and how their optimization landscapes relate to empirical interpretability phenomena.

Abstract

As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.

Paper Structure

This paper contains 21 sections, 3 theorems, 42 equations, 3 figures.

Key Result

Theorem 4.1

Consider the SDL loss function where $\mathbf{x}_p(s)$ and $\mathbf{x}_r(s)$ satisfy the Linear Representation Hypothesis and the Superposition Hypothesis with interference parameter $\epsilon$. When $n_q \geq n$ and $\sigma = \text{ReLU}$, the configuration achieves

Figures (3)

  • Figure 1: Sparse Autoencoder: encoder $W_e$ maps $\mathbf{x}_p$ to sparse latents $\mathbf{x}_q$, decoder $W_d$ reconstructs from $\mathbf{x}_q$.
  • Figure 2: Transcoder: encoder $W_e$ maps layer $\ell$ to sparse latents $\mathbf{x}_q$, decoder $W_d$ predicts layer $\ell+1$.
  • Figure 3: Crosscoder: encoder $W_e$ maps concatenated multi-layer input $\mathbf{x}_p$ to $\mathbf{x}_q$, decoder $W_d$ reconstructs multi-layer output $\mathbf{x}_r$.

Theorems & Definitions (12)

  • Definition 2.1: Input Space
  • Definition 2.2: Model Representation
  • Definition 3.1: Sparse Dictionary Learning
  • Theorem 4.1: Approximate Global Minimum
  • Theorem 4.2: Necessary and Sufficient Conditions for Global Optimality
  • Definition 4.3: Approximate SDL Loss
  • Example 4.4: Spurious Local Minimum
  • Definition 4.5: Feature Absorption
  • Definition 4.6: Realizable Absorption Pattern
  • Theorem 4.7: Existence of Spurious Local Minima
  • ...and 2 more