Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin; Adrià Garriga-Alonso

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adrià Garriga-Alonso

TL;DR

This work investigates how the sparsity hyperparameter $L_0$ in Sparse Autoencoders affects the disentanglement of learned features from LLM activations, showing that both too-low and too-high $L_0$ lead to feature mixing and degraded interpretability. It introduces decoder-based metrics, notably $c_{dec}$ and the related $s_n^{dec}$, to detect mis-specified sparsity and guide $L_0$ selection, validating these diagnostics on toy models and large LLMs (Gemma-2-2b and Llama-3-2-1b). The experiments reveal that peak sparse-probing performance often coincides with the elbow near the true $L_0$, rather than optimal reconstruction, and that many open-source SAEs operate at $L_0$ values that are too low. The findings challenge common sparsity--reconstruction tradeoffs as a sole evaluation criterion and offer practical, metric-driven guidance for setting $L_0$ to obtain interpretable, monosemantic features in LLM contexts.

Abstract

Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

TL;DR

This work investigates how the sparsity hyperparameter

in Sparse Autoencoders affects the disentanglement of learned features from LLM activations, showing that both too-low and too-high

lead to feature mixing and degraded interpretability. It introduces decoder-based metrics, notably

and the related

, to detect mis-specified sparsity and guide

selection, validating these diagnostics on toy models and large LLMs (Gemma-2-2b and Llama-3-2-1b). The experiments reveal that peak sparse-probing performance often coincides with the elbow near the true

, rather than optimal reconstruction, and that many open-source SAEs operate at

values that are too low. The findings challenge common sparsity--reconstruction tradeoffs as a sole evaluation criterion and offer practical, metric-driven guidance for setting

to obtain interpretable, monosemantic features in LLM contexts.

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

TL;DR

Abstract

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (9)