Table of Contents
Fetching ...

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Charles O'Neill, Alim Gumran, David Klindt

TL;DR

The paper establishes a theoretical amortisation gap for sparse autoencoders (SAEs) by showing that, under sparse data projected into fewer dimensions ($M < N$) with the restricted isometry property, a one-pass linear–nonlinear SAE encoder cannot recover all sparse codes optimally. By decoupling encoding and decoding, it empirically compares SAEs, MLPs, and traditional sparse coding across synthetic data and GPT-2 activations, demonstrating that more expressive encoders (notably sparse coding and sometimes MLPs) yield superior sparse inference and interpretability at higher compute costs. The findings generalize to large language model activations, suggesting that richer encoders can improve interpretability without sacrificing validity, albeit with increased computation and optimization complexity. The work advances understanding of neural representations, highlighting when and why amortised inference in SAEs may be suboptimal and pointing toward practical strategies for improved feature extraction in NN interpretability contexts.

Abstract

A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.

Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

TL;DR

The paper establishes a theoretical amortisation gap for sparse autoencoders (SAEs) by showing that, under sparse data projected into fewer dimensions () with the restricted isometry property, a one-pass linear–nonlinear SAE encoder cannot recover all sparse codes optimally. By decoupling encoding and decoding, it empirically compares SAEs, MLPs, and traditional sparse coding across synthetic data and GPT-2 activations, demonstrating that more expressive encoders (notably sparse coding and sometimes MLPs) yield superior sparse inference and interpretability at higher compute costs. The findings generalize to large language model activations, suggesting that richer encoders can improve interpretability without sacrificing validity, albeit with increased computation and optimization complexity. The work advances understanding of neural representations, highlighting when and why amortised inference in SAEs may be suboptimal and pointing toward practical strategies for improved feature extraction in NN interpretability contexts.

Abstract

A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.

Paper Structure

This paper contains 47 sections, 2 theorems, 24 equations, 18 figures, 1 table.

Key Result

Theorem 3.1

Let $S=\mathbb{R}^N$ be $N$ sources following a sparse distribution $P_S$ such that any sample has at most $K \geq 2$ non-zero entries, i.e., $\|s\|_0 \leq K, \forall s \in \text{supp}(P_S)$, where $\text{supp}(P_S)$ forms a union of $K$-dimensional subspaces. The sources are linearly projected into

Figures (18)

  • Figure 1: Illustration of SAE Amortisation Gap.Left, shows sparse sources in an $N=3$ dimensional space with at most $\|s\| \leq K = 2$ non-zero entries. Both blue and red points are valid sources, by contrast, the top right corner $s=(1, 1, 1)$ is not. Middle, shows the sources as they are linearly decoded into observation space. This is, in most applications, the activation space of a neural network that we are trying to lift out of superposition. Right, shows how using a linear-nonlinear encoder, a SAE is tasked to project the points back onto their correct positions. This is not possible, because the pre-activations are at most $M=2$ dimensional (see proof in Appendix \ref{['app:proof']}).
  • Figure 2: Performance comparison of SAE and MLPs in predicting known latent representations. The black dashed line in (b) indicates the average FLOPs at which MLPs surpass SAE performance.
  • Figure 3: Performance comparison of SAE, SAE with inference-time optimisation (SAE+ITO), and MLPs in predicting latent representations with a known dictionary. Dashed lines in (b) indicate extrapolated performance beyond the measured range.
  • Figure 4: Dictionary learning performance comparison when both $s^*$ and $D^*$ are unknown.
  • Figure 5: Difference in final latent MCC between methods across varying $N$ and $M$, for $K=3$ and $K=9$. Left: Sparse coding vs. SAE. Right: MLP vs. SAE. The black dashed line indicates the theoretical recovery boundary.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Theorem 3.1: SAE Amortisation Gap
  • Theorem 1.1: SAE Amortisation Gap
  • proof