Table of Contents
Fetching ...

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

TL;DR

This work addresses mechanistic interpretability of large language models by improving sparse dictionary learning through Gated Sparse Autoencoders. By decoupling feature-detection from magnitude estimation and tying encoder and magnitude paths, the Gated SAE mitigates L1-induced shrinkage and yields sparser, higher-fidelity reconstructions with comparable interpretability to baseline methods. Across GELU-1L, Pythia-2.8B, and Gemma-7B, Gated SAEs exhibit Pareto improvements in reconstruction quality at given sparsity and overcome shrinkage, as shown by relative reconstruction bias metrics. Ablation studies confirm that weight tying, decoder freezing, and the gating mechanism are critical, while human interpretability assessments indicate features are at least as interpretable as those learned by baseline SAEs. These results advance dictionary learning for mechanistic interpretability, enabling more efficient and reliable discovery of interpretable directions in LM activations.

Abstract

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

Improving Dictionary Learning with Gated Sparse Autoencoders

TL;DR

This work addresses mechanistic interpretability of large language models by improving sparse dictionary learning through Gated Sparse Autoencoders. By decoupling feature-detection from magnitude estimation and tying encoder and magnitude paths, the Gated SAE mitigates L1-induced shrinkage and yields sparser, higher-fidelity reconstructions with comparable interpretability to baseline methods. Across GELU-1L, Pythia-2.8B, and Gemma-7B, Gated SAEs exhibit Pareto improvements in reconstruction quality at given sparsity and overcome shrinkage, as shown by relative reconstruction bias metrics. Ablation studies confirm that weight tying, decoder freezing, and the gating mechanism are critical, while human interpretability assessments indicate features are at least as interpretable as those learned by baseline SAEs. These results advance dictionary learning for mechanistic interpretability, enabling more efficient and reliable discovery of interpretable directions in LM activations.

Abstract

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.
Paper Structure (39 sections, 16 equations, 21 figures, 10 tables)

This paper contains 39 sections, 16 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: The performance of Gated SAEs compared to the baseline SAE at Layer 20 in Gemma-7B (log-scale axes from L0=2 to L0=200). The SAEs are trained with equal compute, since the baseline SAEs have 50% more learned features (\ref{['subsec:benchmarking']}). This performance improvement holds in layers throughout GELU-1L, Pythia-2.8B and Gemma-7B (\ref{['app:more_paretos']}). Full detail in \ref{['table:gemma_7b_baselines2']} and \ref{['table:gemma_7b_gated2']}.
  • Figure 2: The L1 penalty in sparse autoencoder causes shrinkage -- reconstructions are biased towards smaller norms, even when perfect reconstruction is possible. E.g. a single-feature SAE (with L1 coefficient $\lambda=1$) reconstructs 1/2 rather than 1 when minimizing \ref{['eqn:sae_loss']}.
  • Figure 3: The Gated SAE architecture with weight sharing between the gating and magnitude paths, shown with an example input.
  • Figure 4: After applying the weight sharing scheme of \ref{['eq:tying_scheme']}, a gated encoder becomes equivalent to a single layer linear encoder with a Jump ReLU erichson2019jumprelu activation function $\sigma_\theta$, illustrated above.
  • Figure 5: Gated SAEs offer better reconstruction fidelity (as measured by loss recovered) at any given level of feature sparsity (as measured by L0). This plot compares Gated and baseline SAEs trained on GELU-1L neuron activations; see \ref{['app:more_paretos']} for comparisons on Pythia-2.8B and Gemma-7B.
  • ...and 16 more figures