Table of Contents
Fetching ...

On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy

Jingyi Cui, Qi Zhang, Yifei Wang, Yisen Wang

TL;DR

This paper provides the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse, and proposes a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead of the observed polysemantic ones.

Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the features learned by large language models (LLMs). By reconstructing features with sparsely activated networks, SAEs aim to recover complex superposed polysemantic features into interpretable monosemantic ones. Despite their wide applications, it remains unclear under what conditions SAEs can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, we provide the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse. To improve the feature recovery of SAEs in general cases, we propose a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead of the observed polysemantic ones. We further establish a theoretical weight selection principle for our proposed weighted SAE (WSAE). Experiments across multiple settings validate our theoretical findings and demonstrate that our WSAE significantly improves feature monosemanticity and interpretability.

On the Limits of Sparse Autoencoders: A Theoretical Framework and Reweighted Remedy

TL;DR

This paper provides the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse, and proposes a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead of the observed polysemantic ones.

Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for interpreting the features learned by large language models (LLMs). By reconstructing features with sparsely activated networks, SAEs aim to recover complex superposed polysemantic features into interpretable monosemantic ones. Despite their wide applications, it remains unclear under what conditions SAEs can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, we provide the first theoretical analysis with a closed-form solution for SAEs, revealing that they generally fail to fully recover the ground truth monosemantic features unless the ground truth features are extremely sparse. To improve the feature recovery of SAEs in general cases, we propose a reweighting strategy targeting at enhancing the reconstruction of the ground truth monosemantic features instead of the observed polysemantic ones. We further establish a theoretical weight selection principle for our proposed weighted SAE (WSAE). Experiments across multiple settings validate our theoretical findings and demonstrate that our WSAE significantly improves feature monosemanticity and interpretability.

Paper Structure

This paper contains 27 sections, 5 theorems, 41 equations, 5 figures, 3 tables.

Key Result

Theorem 1

Let $\mathcal{L}_{\mathrm{SAE}}$ be defined in equation eq::SAE with sparse activation function $\sigma$. If $n_m \geq n$ and the columns of $W_p$ within the superposed dimensions form digons/polygons, then we have $W_m^*=I^*(W_p,\boldsymbol{0})^\top \in \arg\min_{W_m}\mathcal{L}_{\mathrm{SAE}}(W_m;

Figures (5)

  • Figure 1: Theoretical framework for sparse autoencoder (SAE) feature recovery. The superposed polysemantic features $x_p$, composed of ground truth monosemantic features $x$ with matrix $W_p$, serve as the input to the SAE. For the SAE, $W_m$ denotes the weight matrix, $\sigma$ denotes the sparse activation function, and $\mathcal{L}_{\mathrm{SAE}}$ denotes the reconstruction loss of $x_p$. Ideally, we expect the SAE output $x_m$ to fully recover the ground truth monosemantic features $x$ through reconstruction of $x_p$.
  • Figure 2: Monosemanticity (measured by the average activated features) of SAE features increases with increasing sparsity of ground truth monosemantic features.
  • Figure 3: Validation experiments of WSAE ground truth reconstruction on synthetic data. (a) Ground truth reconstruction error $\mathcal{L}_{\mathrm{GT}}$, where WSAE has lower error compared with SAE when the sparsity level $S$ is low. (b) Reconstruction error on the non-sparse dimensions of the ground truth monosemantic features, showing a greater error gap between WSAE and SAE. (c) The reconstruction error of the polysemantic features $x_p$, where the errors of the two methods are comparable. (d) Monosemanticity measured by per-dimensional variance, where WSAE features are more monosemantic compared with SAE features when the sparsity level is low.
  • Figure 4: Semantic consistency (%) of SAEs trained on the embeddings of ResNet-18 with original SAE and weighted SAE loss.
  • Figure 5: (a) Semantic consistency of WSAEs under different $\alpha$. (b) Semantic consistency of SAE and WSAEs with different monosemanticity proxies, including semantic consistency and per-dimensional variance. (c) SAE reconstruction error of WSAEs under different $\alpha$.

Theorems & Definitions (12)

  • Theorem 1: Closed-Form Solution to SAEs
  • Example 1: Feature Shrinking
  • Example 2: Feature Vanishing
  • Theorem 2: Optimality under extreme sparsity
  • Theorem 3: Uniqueness
  • Theorem 4: Gap between $\mathcal{L}_{\mathrm{SAE}}$ and $\mathcal{L}_{\mathrm{GT}}$
  • Theorem 5: Gap between $\mathcal{L}_{\mathrm{WSAE}}$ and $\mathcal{L}_{\mathrm{GT}}$
  • proof : Proof of Theorem \ref{['thm::argminSAE']}
  • proof : Proof of Theorem \ref{['thm::argminSAES']}
  • proof : Proof of Theorem \ref{['thm::unique']}
  • ...and 2 more