Disentangling Polysemantic Channels in Convolutional Neural Networks
Robin Hesse, Jonas Fischer, Simone Schaub-Meyer, Stefan Roth
TL;DR
The paper tackles the interpretability challenge posed by polysemantic channels in CNNs. It introduces a method to identify channels that are relevant to multiple classes and to disentangle each such polysemantic channel into two monosemantic channels plus a residual by inserting an intermediate layer and re-wiring connections based on average relevance. The approach preserves performance while enabling clearer mechanistic interpretability and more faithful feature visualizations. Experiments on ImageNet with ResNet-50 demonstrate both qualitative and quantitative disentanglement of a representative polysemantic channel, illustrating improved concept isolation. This work advances mechanistic interpretability by moving from post-hoc visualizations of polysemantic channels to explicit, network-internal disentanglement of concepts.
Abstract
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
