Table of Contents
Fetching ...

Sparse Autoencoders Trained on the Same Data Learn Different Features

Gonçalo Paulo, Nora Belrose

TL;DR

This work interrogates whether Sparse Autoencoders trained on identical data but with different random seeds converge on the same features. By aligning latent spaces with the Hungarian algorithm and assessing encoder/decoder cosine similarity, it reveals substantial seed-dependent divergence, even in large models where only a minority of features are shared. The study further shows that many seed-specific latents remain interpretable, and that seed-dependence persists across models, datasets, and hyperparameters, challenging the notion of a universal feature set. The findings advocate viewing SAE features as a pragmatic, hierarchical decomposition of activation space rather than an exhaustive catalog of model-used features, illuminating the nuanced landscape of mechanistic interpretability.

Abstract

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.

Sparse Autoencoders Trained on the Same Data Learn Different Features

TL;DR

This work interrogates whether Sparse Autoencoders trained on identical data but with different random seeds converge on the same features. By aligning latent spaces with the Hungarian algorithm and assessing encoder/decoder cosine similarity, it reveals substantial seed-dependent divergence, even in large models where only a minority of features are shared. The study further shows that many seed-specific latents remain interpretable, and that seed-dependence persists across models, datasets, and hyperparameters, challenging the notion of a universal feature set. The findings advocate viewing SAE features as a pragmatic, hierarchical decomposition of activation space rather than an exhaustive catalog of model-used features, illuminating the nuanced landscape of mechanistic interpretability.

Abstract

Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.

Paper Structure

This paper contains 9 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Cosine similarities of features from SAE 1 with their counterparts in SAE 2. Both SAEs have 32K latents, and are trained on the sixth MLP of Pythia 160M. Contour lines are regions of equal density according to kernel density estimation. We color each SAE 1 latent based on whether the Hungarian algorithm matches it to the same counterpart in SAE 2, or to a different one, when using the decoder and encoder directions.
  • Figure 2: Dependence of the number of latents found only in the base SAE on the number of seeds. We consider a latent X in SAE A to be "shared" in SAE B if and only if X is matched to a latent Y in B with which it has cosine similarity greater than 0.7 according to both the encoder and decoder weights. To generate this plot we select a "base" SAE and compute its overlap with all the other seeds, then we average over all different base seeds.
  • Figure 3: Latent similarity vs. firing frequency. We plot the cosine similarity between matched latents, vs. how often the latent fires in the base SAE. The similarity of each latent is averaged over all the matched latents of different seeds. The histograms in this figure are stacked, and the histogram of number of occurrences has a log-scale from 0 to 500, to highlight the few latents that rarely fire or that fire frequently, and a linear-scale from 500 to 4000. Latent occurrences were collected over 10M tokens of the Pile, the same dataset that the SAEs were trained on.
  • Figure 4: Dependence of overlap of a Pythia-160M SAE on size, number of active latents and training time. On the left we see that the fraction of aligned latents decreases with the increase of the number of latents. Middle shows that increasing the number of active latents, by increasing the value of $k$ for the TopK activation function, also decreases the overlap. On the right, training time increases the alignment of different SAE seeds. Unless otherwise indicated, each SAE has $2^{15}$ latents and was trained on the output of the sixth layer MLP of Pythia 160M, on the first 8B tokens of its training corpus, the Pile.
  • Figure 5: Dependence of overlap on SAE hyperparameters. On the right we see the how the fraction of shared latents for a Pythia-160M SAE depends on the layer and on the number of latents. In the middle we compare SAEs with the same expansion factor, 36, trained on different models and positions. On the right we compare SAEs trained on GPT2 using different activation functions and architectures.
  • ...and 4 more figures