Toy Models of Superposition

Nelson Elhage; Tristan Hume; Catherine Olsson; Nicholas Schiefer; Tom Henighan; Shauna Kravec; Zac Hatfield-Dodds; Robert Lasenby; Dawn Drain; Carol Chen; Roger Grosse; Sam McCandlish; Jared Kaplan; Dario Amodei; Martin Wattenberg; Christopher Olah

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah

TL;DR

This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition" as a result of models storing additional sparse features in "superposition".

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Toy Models of Superposition

TL;DR

Abstract

Paper Structure (81 sections, 13 equations, 29 figures)

This paper contains 81 sections, 13 equations, 29 figures.

Toy Models of Superposition
AUTHORS
We recommend reading this paper as an HTML article.
KEY RESULTS FROM OUR TOY MODELS
Definitions and Motivation: Features, Directions, and Superposition
Empirical Phenomena
What are Features?
Features as Directions
Privileged vs Non-privileged Bases
The Superposition Hypothesis
Summary: A Hierarchy of Feature Properties
Demonstrating Superposition
Experiment Setup
THE FEATURE VECTOR ( $X$ )
Linear Model
...and 66 more sections

Figures (29)

Figure 1: In a privileged basis, there is an incentive for features to align with basis dimensions. This doesn't necessarily mean they will. Examples: conv net neurons, transformer MLPs
Figure 2: HYPOTHETICAL DISENTANGLED MODEL
Figure 3: HYPOTHETICAL DISENTANGLED MODEL
Figure 4: A pentagonal bipyramid is the
Figure 5: Digon (Square) Solutions
...and 24 more figures

Toy Models of Superposition

TL;DR

Abstract

Toy Models of Superposition

Authors

TL;DR

Abstract

Table of Contents

Figures (29)