Linear Explanations for Individual Neurons

Tuomas Oikarinen; Tsui-Wei Weng

Linear Explanations for Individual Neurons

Tuomas Oikarinen, Tsui-Wei Weng

TL;DR

The paper tackles the challenge of interpreting individual neurons by showing that relying solely on the very highest activations misses most of a neuron's causal influence. It introduces Linear Explanations (LE), modeling neuron activation as $s(x) = \sum_k w_k \mathbb{P}(c_k|x)$ via a concept activation matrix $P$ built from labels or SigLIP, with a learning pipeline that includes a sparse $w_k$ and a greedy search. It additionally develops a vision-adapted simulation framework using SigLIP to evaluate explanations through correlation $\rho$ and ablation $\alpha$, demonstrating that LE, particularly LE(SigLIP), achieves substantially higher fidelity to actual neuron behavior than prior methods. The work delivers a scalable, quantitative approach to mechanistic interpretability across CNNs and Vision Transformers, enabling more trustworthy and comprehensive neuron-level explanations.

Abstract

In recent years many methods have been developed to understand the internal workings of neural networks, often by describing the function of individual neurons in the model. However, these methods typically only focus on explaining the very highest activations of a neuron. In this paper we show this is not sufficient, and that the highest activation range is only responsible for a very small percentage of the neuron's causal effect. In addition, inputs causing lower activations are often very different and can't be reliably predicted by only looking at high activations. We propose that neurons should instead be understood as a linear combination of concepts, and develop an efficient method for producing these linear explanations. In addition, we show how to automatically evaluate description quality using simulation, i.e. predicting neuron activations on unseen inputs in vision setting.

Linear Explanations for Individual Neurons

TL;DR

via a concept activation matrix

built from labels or SigLIP, with a learning pipeline that includes a sparse

and a greedy search. It additionally develops a vision-adapted simulation framework using SigLIP to evaluate explanations through correlation

and ablation

, demonstrating that LE, particularly LE(SigLIP), achieves substantially higher fidelity to actual neuron behavior than prior methods. The work delivers a scalable, quantitative approach to mechanistic interpretability across CNNs and Vision Transformers, enabling more trustworthy and comprehensive neuron-level explanations.

Abstract

Paper Structure (40 sections, 20 equations, 15 figures, 15 tables, 1 algorithm)

This paper contains 40 sections, 20 equations, 15 figures, 15 tables, 1 algorithm.

Introduction
Motivation: How Important are Different Parts of a Neuron's Activation Pattern?
Definitions
Results
Method
Constructing a concept activation matrix
Learning Linear Explanations
Learn a relatively sparse $w_k$
Greedy search
Improving Evaluation of Explanations via Simulation
Experiment Results
Setup
Qualitative results
Simulation results: Correlation Scoring
Simulation results: Ablation scoring
...and 25 more sections

Figures (15)

Figure 1: Overview of our proposed method: Linear Explanations.
Figure 2: An overview of the simulation pipeline with correlation scoring.
Figure 3: An area chart of the activations of two neurons in layer4 of ResNet-50. We can see Neuron 140 is mostly monosemantic, but represents different types of birds at different activation ranges. In contrast, neuron 136 has two distinct roles, snow and skiing related concepts at high activations and dog-like animals at lower activations.
Figure 4: Descriptions and highly activating images from different ranges of example neurons. We can see Linear Explanation provides a more complete description than baselines in both cases.
Figure 5: The relationship between correlation and ablation score of different explanations.
...and 10 more figures

Linear Explanations for Individual Neurons

TL;DR

Abstract

Linear Explanations for Individual Neurons

Authors

TL;DR

Abstract

Table of Contents

Figures (15)