Table of Contents
Fetching ...

Enforcing Orderedness to Improve Feature Consistency

Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

TL;DR

The paper tackles reproducibility and interpretability gaps in sparse autoencoders by introducing Ordered Sparse Autoencoders (OSAE), which enforce a strict latent feature ordering and deterministically use every feature dimension. Building on Matryoshka SAEs, OSAE employs a nested-prefix (Top-m) objective via nested dropout to induce a canonical ordering and aims for exact ordered recovery under sparsity-uniqueness conditions; theory shows permutation identifiability improvement in overcomplete sparse dictionary learning. Empirically, OSAE improves feature ordering and early-feature stability on Gemma-2 2B and Pythia-70M across multiple seeds and datasets, though it can incur higher reconstruction loss in some settings. Overall, the work provides a principled identifiability mechanism for overcomplete representations, yielding more reproducible and comparable latent features across runs and hyperparameters, with practical implications for interpretability in large-scale representations.

Abstract

Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Enforcing Orderedness to Improve Feature Consistency

TL;DR

The paper tackles reproducibility and interpretability gaps in sparse autoencoders by introducing Ordered Sparse Autoencoders (OSAE), which enforce a strict latent feature ordering and deterministically use every feature dimension. Building on Matryoshka SAEs, OSAE employs a nested-prefix (Top-m) objective via nested dropout to induce a canonical ordering and aims for exact ordered recovery under sparsity-uniqueness conditions; theory shows permutation identifiability improvement in overcomplete sparse dictionary learning. Empirically, OSAE improves feature ordering and early-feature stability on Gemma-2 2B and Pythia-70M across multiple seeds and datasets, though it can incur higher reconstruction loss in some settings. Overall, the work provides a principled identifiability mechanism for overcomplete representations, yielding more reproducible and comparable latent features across runs and hyperparameters, with practical implications for interpretability in large-scale representations.

Abstract

Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Paper Structure

This paper contains 32 sections, 2 theorems, 39 equations, 14 figures, 2 tables.

Key Result

Lemma 3.1

Any minimiser of $\mathcal{L}_{\mathrm{ND}}$ also minimises the full‐prefix loss $\mathcal{L}_k$. That is,

Figures (14)

  • Figure 1: Recovery across SAE variants on a Gaussian toy model with $(d,K,m,N)=(80,100,5,100\,000)$. Each panel plots the Hungarian matching between learned decoder atoms $D$ and ground truth $D^{*}$ (one dot per matched pair; color encodes cosine similarity). Ordered SAEs achieve higher stability $\mathrm{Stab}(D,D^{*})$ (mean matched cosine) and higher orderedness $\mathrm{Ord}(D,D^{*})$ (order agreement), meaning they recover features more faithfully and in order.
  • Figure 2: SAEs trained on Gemma2-2B. (a) Orderedness evaluated at different prefix lengths. O-SAE's have the most consistently ordered features almost reaching an average Ord($D$,$D'$) of 0.8. As expected, we observe orderedness close to 0.0 for the first 128 features of the Fixed MSAE since the first group size is 128, whereafter it jumps up to values between around 0.5. (b) O-SAEs have high stability for the first portion of features, before a sharp decline for later features. $\binom{9}{2}=36$ pairs of seeds are evaluated per method and $95\%$ confidence intervals are visualized.
  • Figure 3: SAEs trained on Pythia-70M (a) O-SAEs demonstrate an improvement in orderedness over Random MSAEs and Fixed MSAEs on the Pile and Dolma after prefix length 128. O-SAE stability is likewise stronger than Random MSAE on both datasets for the beginning features, but crosses over at around prefix length 128. Fixed MSAE stability is higher than O-SAE, but has lower orderedness. (n=10). (b) In the cross-dataset, cross-seed setting we observe that O-SAE has modest improvements in orderedness against Random MSAE and sizable stability gains against Random MSAE before prefix 128. O-SAEs and Random MSAEs demonstrate improvements to orderedness and stability when using the same seed despite training on different datasets; however, the improvements are larger for O-SAEs. (Cross-Seed: n=20. Same-Seed: n=5)
  • Figure 4: Recovery across SAE variants on a Gaussian toy model with $(d,K,m,N)=(80,100,5,100\,000)$. Each panel plots the Hungarian matching between learned decoder atoms $D$ and ground truth $D^{*}$ (one dot per matched pair; color encodes cosine similarity). Ordered SAEs achieve higher stability $\mathrm{Stab}(D,D^{*})$ (mean matched cosine) and higher orderedness $\mathrm{Ord}(D,D^{*})$ (order agreement), meaning they recover features more faithfully and in order.
  • Figure 5: Vanilla SAE (example seed pair).Top: activation rasters for 50 eval inputs (left: $Y^*$, middle: $Z^{(0)}$, right: $Z^{(1)}$). Middle: all-pairs activation–Pearson matrices (left: $Z^{(0)}$ vs. $Y^*$, middle: $Z^{(1)}$ vs. $Y^*$, right: $Z^{(0)}$ vs. $Z^{(1)}$) used for $\mathrm{Stab}_Z$ and $\mathrm{Ord}_Z$. Bottom: all-pairs decoder–cosine matrices (left: $D^{(0)}$ vs. $D^*$, middle: $D^{(1)}$ vs. $D^*$, right: $D^{(0)}$ vs. $D^{(1)}$); this extends Fig. \ref{['fig:toy_model']} from matched pairs to all pairs.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Lemma 3.1
  • Theorem 3.1
  • proof
  • proof : Proof of Lemma \ref{['lem:nd-implies-full']}