Table of Contents
Fetching ...

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Kola Ayonrinde

TL;DR

Two novel SAE variants are proposed, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token and a new auxiliary loss function, $\mathtt{aux\_zipf\_loss}$, which generalises the $\mathtt{aux\_k\_loss}$ to mitigate dead and underutilised features.

Abstract

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most $k$ features. In TopK SAEs, the $k$ active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most $m$ tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, $\mathtt{aux\_zipf\_loss}$, which generalises the $\mathtt{aux\_k\_loss}$ to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

TL;DR

Two novel SAE variants are proposed, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token and a new auxiliary loss function, , which generalises the to mitigate dead and underutilised features.

Abstract

Sparse autoencoders (SAEs) are a promising approach to extracting features from neural networks, enabling model interpretability as well as causal interventions on model internals. SAEs generate sparse feature representations using a sparsifying activation function that implicitly defines a set of token-feature matches. We frame the token-feature matching as a resource allocation problem constrained by a total sparsity upper bound. For example, TopK SAEs solve this allocation problem with the additional constraint that each token matches with at most features. In TopK SAEs, the active features per token constraint is the same across tokens, despite some tokens being more difficult to reconstruct than others. To address this limitation, we propose two novel SAE variants, Feature Choice SAEs and Mutual Choice SAEs, which each allow for a variable number of active features per token. Feature Choice SAEs solve the sparsity allocation problem under the additional constraint that each feature matches with at most tokens. Mutual Choice SAEs solve the unrestricted allocation problem where the total sparsity budget can be allocated freely between tokens and features. Additionally, we introduce a new auxiliary loss function, , which generalises the to mitigate dead and underutilised features. Our methods result in SAEs with fewer dead features and improved reconstruction loss at equivalent sparsity levels as a result of the inherent adaptive computation. More accurate and scalable feature extraction methods provide a path towards better understanding and more precise control of foundation models.

Paper Structure

This paper contains 29 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: A comparison of the pre-activation affinities and the resulting feature activations following different sparsifying activation functions. Red and blue represent positive and negative affinities respectively, with deeper colours representing larger magnitudes. In the first three approaches we have a total sparsity budget of 6. Affinities (Far-left): The token-feature affinities $\textbf{Z'}$, before any sparsifying activation function. Token Choice/TopK (Center-left): We activate the top $k$ features corresponding to each token. Note that there are features that don't fire in this batch, which could lead to dead features. Feature Choice (Center): For each feature, it activates corresponding to the top $m$ tokens with the highest affinity. Note that all features fire in this batch. Mutual Choice (Center-right): The elements with the largest magnitude affinities activate, regardless of their token or feature affiliations. ReLU/Standard (Far-right): All strictly positive elements activate. Here we allow low-magnitude feature activations which may be false positives and which cause the $L_0$ to be higher. Jump-ReLU SAEs can be seen as a special case of ReLU SAEs where the activation threshold is non-zero.
  • Figure 2: $\mathcal{G}$ (left) is a weighted bipartite graph $\mathcal{G} = \{\{T_1, T_2, T_3\} \times \{F_1, F_2, ..., F_6 \}, \textbf{E}\}$. Edge weights represent pre-activation affinities with red and blue representing positive and negative values respectively. We are seeking a subgraph $\mathcal{H} \subseteq \mathcal{G}$ with $M=6$ edges. Here we have defined $\mathcal{H}$ (right) by the TopK method for $k=2$; we select the 2 edges from each token with the largest edge weights.
  • Figure 3: The Feature Density Curve fits a Zipf curve with $R^2=0.982$. The middle part of the feature density distribution (feature 100-20,000) fit the Zipf curve with $R^2$=0.996
  • Figure 4: SAEs trained with the Mutual Choice activation function, and those finetuned with the Feature Choice activation function have better downstream loss recovered at equivalent sparsity levels.
  • Figure 5: The Mutual Choice SAEs with the $\mathtt{aux\_zipf\_loss}$ applied and the Feature Choice SAEs have fewer dead features than SAEs trained without the $\mathtt{aux\_zipf\_loss}$.
  • ...and 2 more figures