PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin
TL;DR
This work tackles the challenge of polysemantic neurons hindering interpretability in deep networks. It introduces PURE, a post-hoc approach that identifies active circuits behind each semantic by leveraging a partial backward pass with relevance messages and then clusters lower-layer attributions to create monosemantic virtual neurons. Applied to ResNet models on ImageNet, PURE yields more purified feature representations as measured by CLIP-based similarity, approaching the performance of DINOv2 baselines and outperforming activation-based disentanglement. The method provides a circuit-grounded pathway to more reliable concept discovery and model correction, with code available for reproducibility and broader adoption in mechanistic interpretability research.
Abstract
The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.
