Table of Contents
Fetching ...

Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

Ge Yan, Tuomas Oikarinen, Tsui-Wei, Weng

TL;DR

This work provides a theoretical foundation for neuron identification in mechanistic interpretability by treating it as an inverse learning problem. It derives a generalization gap bound and high-probability faithfulness guarantees for concept-neuron similarity across probing datasets, and introduces a bootstrap-based stability analysis plus a BE method to construct concept prediction sets with coverage guarantees. Through synthetic and real-data experiments, the paper validates convergence properties of common similarity metrics and demonstrates how probing data size and concept frequency affect faithfulness and stability. The resulting framework enables more trustworthy, quantifiable explanations of individual neurons, advancing rigorous mechanistic interpretability.

Abstract

Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.

Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

TL;DR

This work provides a theoretical foundation for neuron identification in mechanistic interpretability by treating it as an inverse learning problem. It derives a generalization gap bound and high-probability faithfulness guarantees for concept-neuron similarity across probing datasets, and introduces a bootstrap-based stability analysis plus a BE method to construct concept prediction sets with coverage guarantees. Through synthetic and real-data experiments, the paper validates convergence properties of common similarity metrics and demonstrates how probing data size and concept frequency affect faithfulness and stability. The resulting framework enables more trustworthy, quantifiable explanations of individual neurons, advancing rigorous mechanistic interpretability.

Abstract

Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.

Paper Structure

This paper contains 38 sections, 5 theorems, 34 equations, 15 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

With probability at least $1-\delta$, where $r(f, D_{\textup{probe}}, \delta)$ describes the convergence rate of similarity function $\hat{\textsf{sim}}(f, c; D_{\textup{probe}})$ and satisfies In eq:genConclu, the confidence parameter $\delta$ is adjusted using a union bound, replacing $\delta$ with $\frac{\delta}{|C|}$.

Figures (15)

  • Figure 1: Analogous relationship between neuron identification and machine learning. Neuron identification searches for a concept matching a neuron, while machine learning searches for a model matching human labels. Thus, neuron identification can be viewed as inverse of learning process.
  • Figure 2: 95% quantile of error of 5 similarity metrics under two settings: (a) balanced concept frequency; (b) low concept frequency (0.001). Accuracy converges fastest in both settings.
  • Figure 3: Theoretical and simulation results on generalization gap.
  • Figure 4: Illustration of bootstrap ensemble in neuron identification. Multiple probing datasets are generated via bootstrapping. Then, neuron identification algorithm is applied to each dataset and final concepts are aggregated to estimate the probability of each concept.
  • Figure 5: Results of applying bootstrap ensemble to NetDissect and CLIP-Dissect on ResNet-50 neurons. NetDissect shows more stable, concrete concepts. CLIP-Dissect outputs are more diverse and abstract.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Corollary 3.2
  • Theorem 4.1
  • Definition B.1
  • proof : Proof of \ref{['thm:MainGen']}
  • proof : Proof of \ref{['thm:Cor1']}
  • proof
  • Lemma B.2
  • Remark B.3
  • Theorem B.4
  • ...and 1 more