Table of Contents
Fetching ...

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek, Sebastian Lapuschkin

TL;DR

This work tackles the challenge of polysemantic neurons hindering interpretability in deep networks. It introduces PURE, a post-hoc approach that identifies active circuits behind each semantic by leveraging a partial backward pass with relevance messages and then clusters lower-layer attributions to create monosemantic virtual neurons. Applied to ResNet models on ImageNet, PURE yields more purified feature representations as measured by CLIP-based similarity, approaching the performance of DINOv2 baselines and outperforming activation-based disentanglement. The method provides a circuit-grounded pathway to more reliable concept discovery and model correction, with code available for reproducibility and broader adoption in mechanistic interpretability research.

Abstract

The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

TL;DR

This work tackles the challenge of polysemantic neurons hindering interpretability in deep networks. It introduces PURE, a post-hoc approach that identifies active circuits behind each semantic by leveraging a partial backward pass with relevance messages and then clusters lower-layer attributions to create monosemantic virtual neurons. Applied to ResNet models on ImageNet, PURE yields more purified feature representations as measured by CLIP-based similarity, approaching the performance of DINOv2 baselines and outperforming activation-based disentanglement. The method provides a circuit-grounded pathway to more reliable concept discovery and model correction, with code available for reproducibility and broader adoption in mechanistic interpretability research.

Abstract

The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.
Paper Structure (17 sections, 5 equations, 14 figures)

This paper contains 17 sections, 5 equations, 14 figures.

Figures (14)

  • Figure 1: Distinct circuits exist for each feature of a polysemantic neuron. With PURE, we propose to split a polysemantic neuron into multiple pure "virtual" ones, one for each circuit. Here, we disentangle the maximally activating sample (patches) of neuron 2 into its two pure features: "hairy dog" (2a) and "maze" (2b).
  • Figure 2: detects circuits using lower-level neuron attributions for the $n_\text{ref}$ most activating input samples in a preprocessing step. For polysemantic neurons, we assume distinct active circuits for each semantics, which are found through clustering attributions with $k$-means. During test phase, the active circuit can be assigned post-hoc for any new test sample by identifying the closest circuit.
  • Figure 3: Applying to neurons with varying degree of polysemanticity: We show UMAP embeddings with the maximally activating image patches, and the resulting reference sets before and after purification when identifying two circuits via $k$-means.
  • Figure 4: leads to more interpretable representations as measured via CLIP embedding distances on feature visualizations (top left), thereby improving upon activation-based clustering and reaching almost DINOv2 scores. (Bottom left): Distances of CLIP embeddings between feature visualizations correlate with embeddings significantly more than activations. (Right): Activations tend to overestimate distances when unrelated features vary.
  • Figure A.1: Feature visualizations for neurons #1028 (top), #1029 (middle) and #1030 (bottom) for full, cropped-only and cropped as well as masked reference samples. It is visible that cropping improves visualizations by removing irrelevant and distracting image parts not relevant for a neuron semantics.
  • ...and 9 more figures