Table of Contents
Fetching ...

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

TL;DR

The paper tackles how interpretable features extracted by Sparse Autoencoders (SAEs) evolve across neural network layers under polysemanticity and feature superposition. It introduces SAE Match, a data-free method that aligns features across layers by folding activation thresholds into weights and minimizing the $MSE$ between folded SAE parameters, enabling cross-layer feature tracking without input data. Key contributions include the folding operation to account for scale differences, the use of permutation-based matching (and its composition) across layers, and empirical validation on the Gemma 2 model showing feature persistence and approximate state reconstruction, as well as potential for layer pruning. This work provides a practical tool for mechanistic interpretability, offering insights into feature dynamics and layer-wise transformations in large language models.

Abstract

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Mechanistic Permutability: Match Features Across Layers

TL;DR

The paper tackles how interpretable features extracted by Sparse Autoencoders (SAEs) evolve across neural network layers under polysemanticity and feature superposition. It introduces SAE Match, a data-free method that aligns features across layers by folding activation thresholds into weights and minimizing the between folded SAE parameters, enabling cross-layer feature tracking without input data. Key contributions include the folding operation to account for scale differences, the use of permutation-based matching (and its composition) across layers, and empirical validation on the Gemma 2 model showing feature persistence and approximate state reconstruction, as well as potential for layer pruning. This work provides a practical tool for mechanistic interpretability, offering insights into feature dynamics and layer-wise transformations in large language models.

Abstract

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Paper Structure

This paper contains 17 sections, 6 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Left: Hidden state norms and the mean ${\bm{\theta}}$ value in trained JumpReLU activations within SAE modules. Right: Dynamics of hidden state norm changes and the differences in norms of matched decoder columns ${\bm{W}}_{\mathrm{dec}}$ after the folding operation. These results suggest that ${\bm{\theta}}$ captures the growth of hidden state norms. After folding ${\bm{\theta}}$ into the weights, the decoder weights ${\bm{W}}_{\mathrm{dec}}^\prime$ become dependent on the dynamics of hidden state norms, leading to a lower overall MSE during matching. For more details on the folding operation and method description, see Section \ref{['section:method']}.
  • Figure 2: A schematic illustration of the differences in SAE matching with and without folded parameters. When no folding is performed (top), ${\bm{\theta}}$ encapsulates differences in hidden state norms, causing features ${\bm{f}}^{(A)}$ and ${\bm{f}}^{(B)}$ to have different scales, while the columns of decoder weights ${\bm{W}}_{\mathrm{dec}_{i, :}}^{(A)}$ and ${\bm{W}}_{\mathrm{dec}_{i, :}}^{(B)}$ have similar norms. Matching similar columns leads to differences in the actual reconstructions of the input $\hat{{\bm{x}}}$, which we hypothesize is detrimental for matching SAE features. With ${\bm{\theta}}$ folding (bottom), we transfer the differences in input (and thus feature) norms to the decoder weights ${\bm{W}}_{\mathrm{dec}_{i, :}}^{{(A)}^\prime}$ and ${\bm{W}}_{\mathrm{dec}_{i, :}}^{{(B)}^\prime}$, thereby matching features while accounting for differences in input norms. As a result, reconstructions of matched features are closer to each other than in the unfolded variant of the algorithm. See Section \ref{['section:folding']} for more details.
  • Figure 3: Results of the MSE objective for different layer matching methods. "Vanilla matching" refers to matching without any permutations. The "Matched" and "Folded+Matched" variants correspond to unfolded and folded matching, respectively. In all cases, MSE is evaluated with folded parameters (i.e., for unfolded matching, parameters are first matched, then folded, and finally MSE is evaluated). When considering input scales differences (see Section \ref{['section:folding']}), this can be interpreted as the MSE in the scale of actual input reconstructions in the relevant layers. The unfolded matching consistently showed higher MSE in this scale, supporting Hypothesis \ref{['hyp:second']}. Note that $b_{\mathrm{dec}}$ is omitted as it does not affect the order of features in the SAE layer. For further details, refer to Section \ref{['exp_folding']}.
  • Figure 4: External LLM evaluation split by layers. As before - folding thresholds results in an optimistic labeling, it also affects deeper layers making them also more optimistic.
  • Figure 5: Features matched with folded parameters from the 19th to the 20th layer using the proposed method are sorted by their MSE values across the relevant SAE decoder weights. Features with small MSE values (on the left) indicate semantic similarity, while those with large MSE values (on the right) indicate that no similar features were found. For further details, refer to Section \ref{['exp_feature_matching']}.
  • ...and 14 more figures