What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

Victor Lecomte; Kushal Thaman; Rylan Schaeffer; Naomi Bashkansky; Trevor Chow; Sanmi Koyejo

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, Sanmi Koyejo

TL;DR

The paper investigates polysemanticity as a barrier to interpretability in task-optimized networks and proposes incidental polysemanticity as a non-task origin. It develops two toy-model mechanisms—$l_1$ regularization-induced sparsity and hidden-layer noise with negative excess kurtosis—to show how random initializations and training dynamics can create winner-take-all effects that produce polysemantic neurons. The authors derive analytical speedups for sparsification, characterize interference collisions, and validate predictions with numerical simulations, demonstrating a robust scaling of polysemanticity with $\Theta(n^2/m)$. They further compare the two incidental mechanisms and discuss implications for mechanistic interpretability and AI safety, calling for quantitative study of the performance-polysemanticity tradeoff and potential mitigation strategies. Overall, the work broadens the origin story of polysemanticity and highlights practical considerations for avoiding or diagnosing incidental polysemantic representations in real networks.

Abstract

Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more ``features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand networks' internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, a phenomenon we term \textit{incidental polysemanticity}. Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Our paper concludes by calling for further research quantifying the performance-polysemanticity tradeoff in task-optimized deep neural networks to better understand to what extent polysemanticity is avoidable.

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

TL;DR

regularization-induced sparsity and hidden-layer noise with negative excess kurtosis—to show how random initializations and training dynamics can create winner-take-all effects that produce polysemantic neurons. The authors derive analytical speedups for sparsification, characterize interference collisions, and validate predictions with numerical simulations, demonstrating a robust scaling of polysemanticity with

. They further compare the two incidental mechanisms and discuss implications for mechanistic interpretability and AI safety, calling for quantitative study of the performance-polysemanticity tradeoff and potential mitigation strategies. Overall, the work broadens the origin story of polysemanticity and highlights practical considerations for avoiding or diagnosing incidental polysemantic representations in real networks.

Abstract

Paper Structure (27 sections, 38 equations, 7 figures)

This paper contains 27 sections, 38 equations, 7 figures.

Introduction
An alternative origin story
Incidental polysemanticity from regularization
Network and data
Possible solutions
Learning dynamics and loss
The winning neuron takes it all
Sparsity force
How fast does it sparsify?
Numerical simulations
Interference arbiters collisions between features
How strong is the interference?
Benign and malign collisions
Experiments:
Another incentive for sparsity: noise in the hidden layer
...and 12 more sections

Figures (7)

Figure 1: A visualization of the non-linear autoencoder setup with tied weights $W \in \mathbb{R}^{n\times m}$, a single hidden layer of size $m$, $\ell_1$ regularization with parameter $\lambda$, and a $\mathop{\mathrm{ReLU}}\nolimits$ on the output layer.
Figure 2: Number of non-zero coordinates $m'$ in $W_i$ and the value of $||W_i||_1$ plotted with training steps. The simulation confirms the speed of sparsification hypothesis.
Figure 3: Number of polysemantic neurons against the number of neurons in the hidden layer for $16$ different training runs of the non-linear autoencoder with $n=256$.
Figure 4: Sparsification process under bipolar and normal noise of various magnitudes. The line $3/m$ is added in as a reference since for large $m$ it is asymptotic to the fourth norm of a random unit vector.
Figure 5: Final fourth norms under $l_1$ regularization and bipolar noises of various magnitudes. The line $3/m$ is added in as a reference since for large $m$ it is asymptotic to the fourth norm of a random unit vector.
...and 2 more figures

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

TL;DR

Abstract

What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes

Authors

TL;DR

Abstract

Table of Contents

Figures (7)