Linear Explanations for Individual Neurons
Tuomas Oikarinen, Tsui-Wei Weng
TL;DR
The paper tackles the challenge of interpreting individual neurons by showing that relying solely on the very highest activations misses most of a neuron's causal influence. It introduces Linear Explanations (LE), modeling neuron activation as $s(x) = \sum_k w_k \mathbb{P}(c_k|x)$ via a concept activation matrix $P$ built from labels or SigLIP, with a learning pipeline that includes a sparse $w_k$ and a greedy search. It additionally develops a vision-adapted simulation framework using SigLIP to evaluate explanations through correlation $\rho$ and ablation $\alpha$, demonstrating that LE, particularly LE(SigLIP), achieves substantially higher fidelity to actual neuron behavior than prior methods. The work delivers a scalable, quantitative approach to mechanistic interpretability across CNNs and Vision Transformers, enabling more trustworthy and comprehensive neuron-level explanations.
Abstract
In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.
