Table of Contents
Fetching ...

Disentangling Polysemantic Channels in Convolutional Neural Networks

Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth

TL;DR

The paper tackles the interpretability challenge posed by polysemantic channels in CNNs. It introduces a method to identify channels that are relevant to multiple classes and to disentangle each such polysemantic channel into two monosemantic channels plus a residual by inserting an intermediate layer and re-wiring connections based on average relevance. The approach preserves performance while enabling clearer mechanistic interpretability and more faithful feature visualizations. Experiments on ImageNet with ResNet-50 demonstrate both qualitative and quantitative disentanglement of a representative polysemantic channel, illustrating improved concept isolation. This work advances mechanistic interpretability by moving from post-hoc visualizations of polysemantic channels to explicit, network-internal disentanglement of concepts.

Abstract

Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

Disentangling Polysemantic Channels in Convolutional Neural Networks

TL;DR

The paper tackles the interpretability challenge posed by polysemantic channels in CNNs. It introduces a method to identify channels that are relevant to multiple classes and to disentangle each such polysemantic channel into two monosemantic channels plus a residual by inserting an intermediate layer and re-wiring connections based on average relevance. The approach preserves performance while enabling clearer mechanistic interpretability and more faithful feature visualizations. Experiments on ImageNet with ResNet-50 demonstrate both qualitative and quantitative disentanglement of a representative polysemantic channel, illustrating improved concept isolation. This work advances mechanistic interpretability by moving from post-hoc visualizations of polysemantic channels to explicit, network-internal disentanglement of concepts.

Abstract

Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.

Paper Structure

This paper contains 12 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of our proposed disentanglement approach. We show two input samples from two different classes (left), the original neural network (middle), and our disentangled neural network (right). The bar plots in each channel indicate how active that channel is for each of the two input images. The color of the edge indicates if that edge is mainly propagating information from the first image, from the second image, or from both/none. The images in the nodes represent the encoded concepts.
  • Figure 2: Density plot of the activations for the original channel $\#1660$ and the corresponding disentangled channels for images from the classes "digital clock" and "cauliflower". In both plots, one disentangled channel mimics the activation of the original channel $\#1660$ while the other disentangled channel is mostly inactive, indicating that the disentanglement was successful.
  • Figure 3: WordNet similarity of the two classes for which a channel is relevant over their ARV cosine similarity (see Definition \ref{['def:superposition']}).

Theorems & Definitions (1)

  • Definition 3.1: $\gamma$-polysemanticity