Table of Contents
Fetching ...

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adrià Garriga-Alonso

TL;DR

This work investigates how the sparsity hyperparameter $L_0$ in Sparse Autoencoders affects the disentanglement of learned features from LLM activations, showing that both too-low and too-high $L_0$ lead to feature mixing and degraded interpretability. It introduces decoder-based metrics, notably $c_{dec}$ and the related $s_n^{dec}$, to detect mis-specified sparsity and guide $L_0$ selection, validating these diagnostics on toy models and large LLMs (Gemma-2-2b and Llama-3-2-1b). The experiments reveal that peak sparse-probing performance often coincides with the elbow near the true $L_0$, rather than optimal reconstruction, and that many open-source SAEs operate at $L_0$ values that are too low. The findings challenge common sparsity--reconstruction tradeoffs as a sole evaluation criterion and offer practical, metric-driven guidance for setting $L_0$ to obtain interpretable, monosemantic features in LLM contexts.

Abstract

Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

TL;DR

This work investigates how the sparsity hyperparameter in Sparse Autoencoders affects the disentanglement of learned features from LLM activations, showing that both too-low and too-high lead to feature mixing and degraded interpretability. It introduces decoder-based metrics, notably and the related , to detect mis-specified sparsity and guide selection, validating these diagnostics on toy models and large LLMs (Gemma-2-2b and Llama-3-2-1b). The experiments reveal that peak sparse-probing performance often coincides with the elbow near the true , rather than optimal reconstruction, and that many open-source SAEs operate at values that are too low. The findings challenge common sparsity--reconstruction tradeoffs as a sole evaluation criterion and offer practical, metric-driven guidance for setting to obtain interpretable, monosemantic features in LLM contexts.

Abstract

Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.

Paper Structure

This paper contains 58 sections, 3 theorems, 53 equations, 29 figures.

Key Result

Theorem 1

Consider a toy model with two orthonormal features $\mathbf{f}_1, \mathbf{f}_2 \in \mathbb{R}^d$ where $\mathbf{f}_1 \cdot \mathbf{f}_2 = 0$ and $\|\mathbf{f}_1\|_2 = \|\mathbf{f}_2\|_2 = 1$. Let $\mathbf{f}_1$ fire alone with probability $P_1$, $\mathbf{f}_2$ fire alone with probability $P_2$, and

Figures (29)

  • Figure 1: When SAE L0 is too low (left) or too high (right), the SAE mixes together correlated features, ruining monsemanticity. Only at the correct L0 (middle), the SAE learns correct features.
  • Figure 2: (left) Toy model feature correlation matrix showing positive correlations between features. (middle) SAE decoder cosine similarities with true feature when SAE L0 = 2, matching the true L0 of the toy model. (right) SAE decoder cosine similarities with true features when SAE L0 = 1.8. When L0 is too low, the SAE mixes components of features based on their firing correlations.
  • Figure 3: (left) Toy model feature correlation matrix showing negative correlations between features. (middle) SAE decoder cosine similarities with true feature when SAE L0 = 2, matching the true L0 of the toy model. (right) SAE decoder cosine similarities with true features when SAE L0 = 1.8,. When L0 is too low, the SAE mixes negative components of anti-correlated features.
  • Figure 4: Sparsity ($L0$, lower is better) vs reconstruction (variance explained, higher is better) for learned SAEs and a ground-truth SAE. When L0 is less than the true L0 of the toy model (the dotted line), the trained SAE gets better reconstruction than the ground-truth SAE. Sparsity--reconstruction plots like this lead us to the incorrect conclusion that the ground-truth SAE is a worse SAE.
  • Figure 5: SAE decoder cosine similarity with true features for the SAEs from Figure \ref{['fig:sparsity_vs_reconstruction']} with L0=1 (left) and L0=5 (middle), compared with the ground-truth SAE (right). The trained SAEs score much better than the ground truth SAE on variance explained, despite their corrupted, polysemantic latents.
  • ...and 24 more figures

Theorems & Definitions (9)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 1
  • Theorem 3
  • proof
  • Remark 2
  • Remark 3