Table of Contents
Fetching ...

Tackling Polysemanticity with Neuron Embeddings

Alex Foote

TL;DR

How neuron embeddings can be used to measure neuron polysemanticity is described, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs) and provides a UI for exploring the results.

Abstract

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Tackling Polysemanticity with Neuron Embeddings

TL;DR

How neuron embeddings can be used to measure neuron polysemanticity is described, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs) and provides a UI for exploring the results.

Abstract

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of the neuron embedding process. We compute the element-wise products of the pre-MLP embedding of the inputs and the neuron input weights to produce the neuron embeddings. These are then clustered based on similarity. The neuron weights select the relevant information from the embedding, such that the neuron embeddings of two different inputs can be brought together or pushed apart.
  • Figure 2: An example of feature clustering applied to a neuron in layer 7 of GPT2-small. The clusters (colour and numerically coded) each show a distinct semantic behaviour, and the dendrogram shows how the cluster hierarchy formed. The highlighting corresponds to neuron activation on each token, with the neuron embedding derived from the maximally activating token.
  • Figure 3: An example of a neuron with a common primary behaviour (orange) and a rare secondary behaviour (blue).
  • Figure 4: A comparison between feature clusters derived from neuron embeddings vs pre-MLP embeddings. The neuron embeddings clearly result in denser clusters with better separation between the clusters. Examples from the two clusters are shown in their corresponding colours.
  • Figure 5: Visualisations for an SAE neuron. The activation map shows the maximum magnitude of neuron activation for each pixel in the input, and the importance map is the average dataset example scaled by the activation map.