Table of Contents
Fetching ...

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

TL;DR

MLSAE introduces a single sparse autoencoder trained on residual-stream activations from all transformer layers to analyze cross-layer information flow in language models. The approach reveals that, while individual latents often activate at a single layer for a given token, aggregating across many tokens shows latent activations spread over multiple layers, with this cross-layer spread increasing in larger models. Tuned-lens transformations modestly alter these patterns, but cross-layer drift persists, highlighting a fundamental challenge for cross-layer interpretability. The work provides a practical, reproducible framework for mechanistic interpretability in transformers and releases code and models to enable replication and further study.

Abstract

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to 'switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

Residual Stream Analysis with Multi-Layer SAEs

TL;DR

MLSAE introduces a single sparse autoencoder trained on residual-stream activations from all transformer layers to analyze cross-layer information flow in language models. The approach reveals that, while individual latents often activate at a single layer for a given token, aggregating across many tokens shows latent activations spread over multiple layers, with this cross-layer spread increasing in larger models. Tuned-lens transformations modestly alter these patterns, but cross-layer drift persists, highlighting a fundamental challenge for cross-layer interpretability. The work provides a practical, reproducible framework for mechanistic interpretability in transformers and releases code and models to enable replication and further study.

Abstract

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to 'switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.
Paper Structure (29 sections, 12 equations, 41 figures, 2 tables)

This paper contains 29 sections, 12 equations, 41 figures, 2 tables.

Figures (41)

  • Figure 1: The mean cosine similarities between the residual stream activation vectors at adjacent layers of transformers, over 10 million tokens from the test set. To compare transformers with different numbers of layers, we divide the lower of each pair of adjacent layers by the number of pairs. This 'relative layer' is the $x$-axis of the plot. We subtract the dataset mean from the activation vectors at each layer before computing cosine similarities to control for changes in the norm between layers heimersheim_residual_2023, which we demonstrate in Figure \ref{['fig:resid_l2_norm']}.
  • Figure 2: Heatmaps of the distributions of latent activations over layers when aggregating over 10 million tokens from the test set. Here, we plot the distributions for MLSAEs trained on Pythia models with an expansion factor of $R = 64$ and sparsity $k = 32$. The latents are sorted in ascending order of the expected value of the layer index (Eq. \ref{['eqn:layer_probability']}).
  • Figure 3: Heatmaps of the distributions of latent activations over layers for a single example prompt. Here, we plot the distributions for MLSAEs trained on Pythia models with an expansion factor of $R = 64$ and sparsity $k = 32$. The example prompt is "When John and Mary went to the store, John gave" wang_interpretability_2022. We exclude latents with maximum activation below 1.0e-3 and sort latents in ascending order of the expected value of the layer index (Eq. \ref{['eqn:layer_probability']}).
  • Figure 4: The mean $L^2$ norm of the residual stream activation vectors at every layer, over 10 million tokens from the test set. To compare transformers with different numbers of layers, we divide the layer index $\ell$ by the number of layers ${n_L}$. This 'relative layer' is the $x$-axis of the plot.
  • Figure 5: The fraction of the total variance explained by individual latents and the fraction of the variance for an individual latent explained by individual tokens (Eqs. \ref{['eqn:variance_ratio_latent']} and \ref{['eqn:variance_ratio_token']}) for MLSAEs with an expansion factor of $R=64$ and sparsity $k=32$, over 10 million tokens from the test set. The absence of bars for tuned-lens MLSAEs indicates the absence of results, not that the values are zero.
  • ...and 36 more figures