Table of Contents
Fetching ...

Evolution of SAE Features Across Layers in LLMs

Daniel Balcells, Benjamin Lerner, Michael Oesterle, Ediz Ucar, Stefan Heimersheim

TL;DR

This work analyzes statistical relationships between features in adjacent layers to understand how features evolve through a forward pass and finds that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Abstract

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors (https://stefanhex.com/spar-2024/feature-browser/), and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Evolution of SAE Features Across Layers in LLMs

TL;DR

This work analyzes statistical relationships between features in adjacent layers to understand how features evolve through a forward pass and finds that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Abstract

Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors (https://stefanhex.com/spar-2024/feature-browser/), and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Paper Structure

This paper contains 27 sections, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Different motifs we found in our feature graph, active on a single forward pass. Nodes represent SAE features in different layers, earlier layers are at the bottom. (a) "Pass-through" features have high correlation and similar semantic meaning between layers (b) "New" features don't have a counterpart in the preceding layer, we find some that appear to be AND/OR gates of preceding features. (c) "Disappearing" features don't have a similar feature in the following layer. (d) We find many clusters of related features by running modularity detection algorithms on the full correlation graph, shown here as colors.
  • Figure 2: The community structure within an SAE feature graph. Nodes represent all the features that were active in the residual stream for a specific prompt of text. The rows of the graph correspond to layers of the transformer such that the bottom row corresponds with the first layer and the top row corresponds to the last layer. The edges between the nodes show the Jaccard similarity between two features $>0.1$. The nodes are coloured by the community they were assigned by the Leiden algorithm using the modularity quality function - nodes within a community are semantically similar to one another. For example, the pink community on the far left consists of features related to "Instructions directing on to take action". This graph can be viewed in the feature browser by selecting the jaccard_leiden_modularity_threshold_0.1_masked_single_23 option in the settings.
  • Figure 3: Passed-through/appearing/disappearing SAE features where Pearson $\geq 0.95$
  • Figure 4: Left: Visualizing the recovery of the previous layer's features from the next layer's error term. Right: Heatmaps of SAE feature activation vs next-layer error projected onto the feature directions of the previous layer, across many tokens. Only features which "disappeared" (necessity $< 0.4$ with all next layer features), with an activation $>0.1\%$ of their max activation are shown.
  • Figure 5: Communities found by applying the Leiden algorithm with a modularity quality function. Intra-layer feature cosine similarity in each community (not used for forming the communities) was measured to be $\ge0.75$. https://www.neuronpedia.org/list/cm158ljma000dbz0hs79d5obl: Necessity based. https://www.neuronpedia.org/gpt2-small/2-res-jb/24349 detects "evidence" in the context of court cases, causes of economic phenomena, and scientific data. Downstream, https://www.neuronpedia.org/gpt2-small/4-res-jb/8314 specializes in court evidence. https://www.neuronpedia.org/list/cm158pj0m0001v04u1xl5o82y: Necessity based. https://www.neuronpedia.org/gpt2-small/5-res-jb/16673 detects "special" in several contexts, https://www.neuronpedia.org/gpt2-small/7-res-jb/5871 focuses on "Special" in the title of a law enforcement official. https://www.neuronpedia.org/list/cm158t2jc0001127a5v8k7bco: Jaccard based. https://www.neuronpedia.org/gpt2-small/6-res-jb/22156, and each feature downstream detects the concept of "an important moment".
  • ...and 14 more figures