Table of Contents
Fetching ...

Understanding polysemanticity in neural networks through coding theory

Simon C. Marshall, Jan H. Kirchner

TL;DR

The paper tackles interpretability by addressing polysemanticity in neural networks and proposing a coding-theory lens that treats activations as a coded representation. It uses the eigenspectrum of the activation covariance and random projections to quantify redundancy, approximate channel capacity, and assess whether the code is smooth or non-differentiable, with a power-law exponent $α$ characterizing the decay. Empirical results on a ResNet-based autoencoder and on GPT-2 XL show depth-dependent $α$ values and that dropout promotes redundancy and error-correcting structure, while random projections provide complementary explanations to single-neuron probes. This top-down framework offers new avenues for circuit-level interpretability and robustness in large-scale models.

Abstract

Despite substantial efforts, neural network interpretability remains an elusive goal, with previous research failing to provide succinct explanations of most single neurons' impact on the network output. This limitation is due to the polysemantic nature of most neurons, whereby a given neuron is involved in multiple unrelated network states, complicating the interpretation of that neuron. In this paper, we apply tools developed in neuroscience and information theory to propose both a novel practical approach to network interpretability and theoretical insights into polysemanticity and the density of codes. We infer levels of redundancy in the network's code by inspecting the eigenspectrum of the activation's covariance matrix. Furthermore, we show how random projections can reveal whether a network exhibits a smooth or non-differentiable code and hence how interpretable the code is. This same framework explains the advantages of polysemantic neurons to learning performance and explains trends found in recent results by Elhage et al.~(2022). Our approach advances the pursuit of interpretability in neural networks, providing insights into their underlying structure and suggesting new avenues for circuit-level interpretability.

Understanding polysemanticity in neural networks through coding theory

TL;DR

The paper tackles interpretability by addressing polysemanticity in neural networks and proposing a coding-theory lens that treats activations as a coded representation. It uses the eigenspectrum of the activation covariance and random projections to quantify redundancy, approximate channel capacity, and assess whether the code is smooth or non-differentiable, with a power-law exponent characterizing the decay. Empirical results on a ResNet-based autoencoder and on GPT-2 XL show depth-dependent values and that dropout promotes redundancy and error-correcting structure, while random projections provide complementary explanations to single-neuron probes. This top-down framework offers new avenues for circuit-level interpretability and robustness in large-scale models.

Abstract

Despite substantial efforts, neural network interpretability remains an elusive goal, with previous research failing to provide succinct explanations of most single neurons' impact on the network output. This limitation is due to the polysemantic nature of most neurons, whereby a given neuron is involved in multiple unrelated network states, complicating the interpretation of that neuron. In this paper, we apply tools developed in neuroscience and information theory to propose both a novel practical approach to network interpretability and theoretical insights into polysemanticity and the density of codes. We infer levels of redundancy in the network's code by inspecting the eigenspectrum of the activation's covariance matrix. Furthermore, we show how random projections can reveal whether a network exhibits a smooth or non-differentiable code and hence how interpretable the code is. This same framework explains the advantages of polysemantic neurons to learning performance and explains trends found in recent results by Elhage et al.~(2022). Our approach advances the pursuit of interpretability in neural networks, providing insights into their underlying structure and suggesting new avenues for circuit-level interpretability.
Paper Structure (4 sections, 4 equations, 7 figures)

This paper contains 4 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Network capacity of deep artificial neural networks.a. Schematic of deep artificial neural network (ANN) receiving a set of input signals (left; orange) and processing the signal through consecutive layers of neurons (right; shades of gray) connected in an all-to-all fashion. b. Schematic of different possible coding schemes within a layer of the ANN. Monosemantic coding (left): one-to-one identification between inputs and neuron activations (highlighted in orange) without overlap. Polysemantic coding (middle): One-to-one identification between inputs and neuron activations with overlap. Superposition coding (right): One-to-many identification between inputs and neuron activations. Note that the third scenario is a sub-type of the second scenario. c. Number of dimensions needed in a neural network's hidden layer per input feature as a function of varying feature sparsity(gray) compared to our theoretical prediction of optimal encoding (green). Data extracted from ref. elhage2022toy. Dashed horizontal lines indicate "sticky" regionselhage2022toy where the network code deviates from optimal.
  • Figure 2: Redundancy in channel code enables noise robustness.a. Illustration of the effect of dropout on the three types of code introduced in Figure \ref{['fig:schematic']}. While dropout corrupts information in the monosemantic (top, circle) and superposition (bottom, star) codes, a polysemantic code can recover from dropout. b., c. Network capacity as a function of network for different levels of dropout $\alpha$ for monosemantic (b, circle), superposition (b, star) and polysemantic code (c, square).
  • Figure 3: The code shapes hidden activation eigenspectrum.a. Left: Schematic illustrating input to (natural images from the CIFAR dataset) and hidden activations of an Autoencoder-ResNet (see Methods for details). Right: Idealized eigenspectrum for hidden activations implementing a maximally redundant (top) or maximally non-redundant (bottom) code. b. Log-log plot of eigenvalues as a function of principal components for hidden activations from different network depths (shades of grey and legend in bottom right). Dashed line indicates the power law fit obtained from Huber regression (see Methods). Large deviations from the power law distribution exist for later layers, hence power law fits represent approximations. Note that the later eigenspectra resemble the predicted spectrum of a linear code (compare a), where the code words are simple linear combinations. c. Estimated power law exponent $\alpha$ as a function of network depth. Dashed line marks $\alpha=-1$. d. Sample of different eigenvectors (rows) displayed as filters of the ResNet.
  • Figure 4: Random projections of hidden activations exhibit varying amount of smoothness.a. Simple moving dot stimulus generated from the $\sin$ and $\cos$ function (top), converted into a stack of frames (middle), and passed into the network to yield hidden activations (bottom). b. Sample of random projections of hidden activations produced from the moving dot stimulus. Color of box indicates layer of the network, and color of the random projection indicates time. Inset text shows average activation. c. Average action of random projections as a function of network depth. Average computed over 1000 random projections (see Methods). d. Average action of random projections as a function of the estimated power law exponent $\alpha$.
  • Figure 5: Higher noise induces more robust codes. Log-log plot of eigenvalues as a function of principal components for hidden activations for two different networks with different dropout rates. The increase in non-linearity indicates a learnt code which is more robust to noise (which dropout introduces).
  • ...and 2 more figures