Tackling Polysemanticity with Neuron Embeddings

Alex Foote

Tackling Polysemanticity with Neuron Embeddings

Alex Foote

TL;DR

How neuron embeddings can be used to measure neuron polysemanticity is described, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs) and provides a UI for exploring the results.

Abstract

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Tackling Polysemanticity with Neuron Embeddings

TL;DR

Abstract

Tackling Polysemanticity with Neuron Embeddings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)