Mechanistic Permutability: Match Features Across Layers
Nikita Balagansky, Ian Maksimov, Daniil Gavrilov
TL;DR
The paper tackles how interpretable features extracted by Sparse Autoencoders (SAEs) evolve across neural network layers under polysemanticity and feature superposition. It introduces SAE Match, a data-free method that aligns features across layers by folding activation thresholds into weights and minimizing the $MSE$ between folded SAE parameters, enabling cross-layer feature tracking without input data. Key contributions include the folding operation to account for scale differences, the use of permutation-based matching (and its composition) across layers, and empirical validation on the Gemma 2 model showing feature persistence and approximate state reconstruction, as well as potential for layer pruning. This work provides a practical tool for mechanistic interpretability, offering insights into feature dynamics and layer-wise transformations in large language models.
Abstract
Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.
