Table of Contents
Fetching ...

Stable and Steerable Sparse Autoencoders with Weight Regularization

Piotr Jedryszek, Oliver M. Crook

TL;DR

It is observed that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency on MNIST.

Abstract

Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

Stable and Steerable Sparse Autoencoders with Weight Regularization

TL;DR

It is observed that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency on MNIST.

Abstract

Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.
Paper Structure (25 sections, 3 equations, 9 figures, 1 table)

This paper contains 25 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: MNIST cosine similarity and feature visualizations.Left column (A--C): Histograms of encoder--decoder cosine similarity for base, L1, and L2 SAEs (1,568 latents). L2 creates a bimodal distribution with a small high-alignment core. Right panels (D--I): Feature pairs showing encoder and decoder (paired next to each other) as 28$\times$28 heatmaps (blue = negative, red = positive). Left subcolumns show random samples; right subcolumns show high cosine-similarity features. L2's high-alignment features capture clean strokes and curves, while the base (no regularization) features appear noisy. The plotted features come from the SAEs without decoder norm constraints.
  • Figure 2: MNIST shared vs. random feature visualization. Encoder weights as 28$\times$28 heatmaps (blue = negative, red = positive). Top two rows show random features; bottom two rows show features classified as shared between at SAEs with random seed 0 and 2. Each feature scaled to $\pm\max(|\text{weights}|)$. Shared features capture clean strokes and curves; random features appear noisy. Features imaged are from the SAE with tied weight and constrained decoder but no weight penalty.
  • Figure 3: Encoder--decoder cosine similarity distributions for regularized ("reg") vs. unregularized ("no reg") SAEs across architectures. Dead features (encoder or decoder norm $= 0$) are excluded; for TopK-L2, the majority of features fall into this category (see Appendix \ref{['app:topk_dead']}). The distributions plotted are of the pareto best SAE for each architecture and the not regularized SAE with the corresponding sparsity penalty k.
  • Figure 4: Pythia-70M TopK: cross-seed feature consistency metrics across sparsity levels ($k$) for unregularized (blue) and L2-regularized (orange) SAEs. L2 weight regularization substantially increases sharedness, particularly among alive features (orange bars).
  • Figure 5: Pythia-70M TopK $k{=}40$: (A) Steering success rate (LLM judge score $\ge 4$); L2 regularization roughly doubles the rate. (B) Auto-interpretability score distributions remain similar across conditions. (C) Spearman correlation between auto-interpretability and steering success; regularization strengthens the link.
  • ...and 4 more figures