Table of Contents
Fetching ...

Interpreting Attention Layer Outputs with Sparse Autoencoders

Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda

TL;DR

This work introduces Attention Output Sparse Autoencoders (SAEs) to decompose attention layer activations in transformers up to 2B parameters, addressing the polysemanticity of attention heads. It develops weight-based head attribution, direct feature attribution, and Recursive Direct Feature Attribution (RDFA) to map sparse, interpretable features to specific heads and upstream components, enabling circuit-level analyses. The study identifies three primary feature families—induction, local context, and high-level context—and demonstrates substantial head polysemanticity in GPT-2 Small, including long-prefix versus short-prefix induction distinctions and insights into the Indirect Object Identification circuit. The authors provide open-source SAEs, dashboards, and a circuit explorer to empower further mechanistic interpretability research.

Abstract

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.

Interpreting Attention Layer Outputs with Sparse Autoencoders

TL;DR

This work introduces Attention Output Sparse Autoencoders (SAEs) to decompose attention layer activations in transformers up to 2B parameters, addressing the polysemanticity of attention heads. It develops weight-based head attribution, direct feature attribution, and Recursive Direct Feature Attribution (RDFA) to map sparse, interpretable features to specific heads and upstream components, enabling circuit-level analyses. The study identifies three primary feature families—induction, local context, and high-level context—and demonstrates substantial head polysemanticity in GPT-2 Small, including long-prefix versus short-prefix induction distinctions and insights into the Indirect Object Identification circuit. The authors provide open-source SAEs, dashboards, and a circuit explorer to empower further mechanistic interpretability research.

Abstract

Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.

Paper Structure

This paper contains 66 sections, 8 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Overview. We train Sparse Autoencoders (SAEs) on $\textbf{z}_{\text{cat}}$, the attention layer outputs pre-linear, concatenated across all heads. The SAEs extract linear directions that correspond to concepts in the model, giving us insight into what attention layers learn in practice. Further, we uncover what information was used to compute these features with direct feature attribution (DFA, \ref{['sec:methodology']}).
  • Figure 2: Specificity plot bricken2023monosemanticity (a) which compares the distribution of the board induction feature activations to the activation of our proxy. The expected value plot (b) shows distribution of feature activations weighted by activation level bricken2023monosemanticity, compared to the activation of the proxy. Note red is stacked on top of blue, where blue represents examples that our proxy identified as board induction. We notice high specificity above the weakest feature activations.
  • Figure 3: An indication of polysemanticity for head 10.2: On synthetic datasets for two unrelated tasks, digits copying (a) and URL completion (b), ablating 10.2 causes a large average effect on the loss relative to the other heads in layer 10.
  • Figure 4: Two lines of evidence that 5.1 specializes in long prefix induction, while 5.5 primarily does short prefix induction. In (a) we see that 5.1's induction score olsson2022context sharply increases from less than 0.3 to over 0.7 as we transition to long prefix lengths, while 5.5 already starts at 0.7 for short prefixes. In (b) we see that intervening on examples of long prefix induction from the training distribution causes 5.1 to essentially stop attending to that token, while 5.5 continues to show an induction attention pattern.
  • Figure 5: Results from two noising experiments on induction layers' attention outputs at S2 position. Noising from a distribution that just changes " and" to " alongside" degrades performance, while 3 simultaneous perturbations that maintains whether the duplicate name is after the ‘ and’ token preserve 93% of average logit difference.
  • ...and 16 more figures