Table of Contents
Fetching ...

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah

TL;DR

This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition" as a result of models storing additional sparse features in "superposition".

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Toy Models of Superposition

TL;DR

This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition" as a result of models storing additional sparse features in "superposition".

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
Paper Structure (81 sections, 13 equations, 29 figures)

This paper contains 81 sections, 13 equations, 29 figures.

Figures (29)

  • Figure 1: In a privileged basis, there is an incentive for features to align with basis dimensions. This doesn't necessarily mean they will. Examples: conv net neurons, transformer MLPs
  • Figure 2: HYPOTHETICAL DISENTANGLED MODEL
  • Figure 3: HYPOTHETICAL DISENTANGLED MODEL
  • Figure 4: A pentagonal bipyramid is the
  • Figure 5: Digon (Square) Solutions
  • ...and 24 more figures