Understanding polysemanticity in neural networks through coding theory
Simon C. Marshall, Jan H. Kirchner
TL;DR
The paper tackles interpretability by addressing polysemanticity in neural networks and proposing a coding-theory lens that treats activations as a coded representation. It uses the eigenspectrum of the activation covariance and random projections to quantify redundancy, approximate channel capacity, and assess whether the code is smooth or non-differentiable, with a power-law exponent $α$ characterizing the decay. Empirical results on a ResNet-based autoencoder and on GPT-2 XL show depth-dependent $α$ values and that dropout promotes redundancy and error-correcting structure, while random projections provide complementary explanations to single-neuron probes. This top-down framework offers new avenues for circuit-level interpretability and robustness in large-scale models.
Abstract
Despite substantial efforts, neural network interpretability remains an elusive goal, with previous research failing to provide succinct explanations of most single neurons' impact on the network output. This limitation is due to the polysemantic nature of most neurons, whereby a given neuron is involved in multiple unrelated network states, complicating the interpretation of that neuron. In this paper, we apply tools developed in neuroscience and information theory to propose both a novel practical approach to network interpretability and theoretical insights into polysemanticity and the density of codes. We infer levels of redundancy in the network's code by inspecting the eigenspectrum of the activation's covariance matrix. Furthermore, we show how random projections can reveal whether a network exhibits a smooth or non-differentiable code and hence how interpretable the code is. This same framework explains the advantages of polysemantic neurons to learning performance and explains trends found in recent results by Elhage et al.~(2022). Our approach advances the pursuit of interpretability in neural networks, providing insights into their underlying structure and suggesting new avenues for circuit-level interpretability.
