Table of Contents
Fetching ...

Interpreting Neural Networks through the Polytope Lens

Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

TL;DR

This work critiques neuron- and direction-centric views of neural representations, arguing that polysemanticity and nonlinear activations limit such units as fundamental descriptions. It introduces the polytope lens, viewing networks with piecewise-linear activations as partitioned into polytopes that each implement an affine transformation, with spline codes tracking polytopes across layers. The authors present three predictions—monosemantic polytope regions, polytope-boundary–driven semantic boundaries, and polytope-defined validity regions for feature-directions—and provide experimental evidence from image and language models, showing partial support and highlighting the representational flow between polytopes. While promising, the approach faces scalability challenges and may be most powerful when integrated with, rather than replacing, existing linear-style interpretations. The polytope lens thus offers a framework to incorporate nonlinearity into mechanistic interpretability and guides future work on representational flow and activation-scale effects.

Abstract

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.

Interpreting Neural Networks through the Polytope Lens

TL;DR

This work critiques neuron- and direction-centric views of neural representations, arguing that polysemanticity and nonlinear activations limit such units as fundamental descriptions. It introduces the polytope lens, viewing networks with piecewise-linear activations as partitioned into polytopes that each implement an affine transformation, with spline codes tracking polytopes across layers. The authors present three predictions—monosemantic polytope regions, polytope-boundary–driven semantic boundaries, and polytope-defined validity regions for feature-directions—and provide experimental evidence from image and language models, showing partial support and highlighting the representational flow between polytopes. While promising, the approach faces scalability challenges and may be most powerful when integrated with, rather than replacing, existing linear-style interpretations. The polytope lens thus offers a framework to incorporate nonlinearity into mechanistic interpretability and guides future work on representational flow and activation-scale effects.

Abstract

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.
Paper Structure (17 sections, 3 equations, 34 figures)

This paper contains 17 sections, 3 equations, 34 figures.

Figures (34)

  • Figure 1: An example of a polysemantic neuron in InceptionV1 (layer inception5a, neuron 233) which seems to respond to a mix of dog noses and metal poles (and maybe boats).
  • Figure 2: An example of a polysemantic neuron in GPT2-Medium. The text highlights represent the activation magnitude - the redder the text, the larger the activation. We can see that this neuron seems to react strongly to commas in lists, but also to diminutive adjectives (‘small’, ‘lame’, ‘tired’) and some prepositions (‘of’, ‘in’, ‘by’), among other features.
  • Figure 3: Scaling the activations in a layer causes semantic changes later in the network despite no change in activation direction in the scaled layer. The image on the right represents the input image.
  • Figure 4: Affine transformations in the activated / unactivated regions of one neuron (assuming the three other neurons are activated).
  • Figure 5: Polytope boundaries are defined by the weights and bias of a neuron. The weights determine the orientation of the (hyper-) plane and the bias determines its height.
  • ...and 29 more figures