Table of Contents
Fetching ...

Explaining Neural Networks with Reasons

Levin Hornischer, Hannes Leitgeb

TL;DR

This work introduces a reasoning-based interpretability framework for neural networks that treats neurons as epistemic reasons for propositions, yielding per-neuron reasons vectors and a strength metric to quantify how strongly a neuron supports specific propositions. Grounded in a formal theory of reasons, the method aggregates both logico-symbolic and Bayesian perspectives and applies across architectures via forward passes. Empirical results span LeNet on MNIST, robustness and fairness improvements through a doxastic-reason loss, and mechanistic interpretability in LLMs, demonstrating faithfulness, correctness, and scalability. The findings suggest that aligning model mechanisms with a principled notion of reasons can enhance robustness, fairness, and transparency while preserving accuracy.

Abstract

We propose a new interpretability method for neural networks, which is based on a novel mathematico-philosophical theory of reasons. Our method computes a vector for each neuron, called its reasons vector. We then can compute how strongly this reasons vector speaks for various propositions, e.g., the proposition that the input image depicts digit 2 or that the input prompt has a negative sentiment. This yields an interpretation of neurons, and groups thereof, that combines a logical and a Bayesian perspective, and accounts for polysemanticity (i.e., that a single neuron can figure in multiple concepts). We show, both theoretically and empirically, that this method is: (1) grounded in a philosophically established notion of explanation, (2) uniform, i.e., applies to the common neural network architectures and modalities, (3) scalable, since computing reason vectors only involves forward-passes in the neural network, (4) faithful, i.e., intervening on a neuron based on its reason vector leads to expected changes in model output, (5) correct in that the model's reasons structure matches that of the data source, (6) trainable, i.e., neural networks can be trained to improve their reason strengths, (7) useful, i.e., it delivers on the needs for interpretability by increasing, e.g., robustness and fairness.

Explaining Neural Networks with Reasons

TL;DR

This work introduces a reasoning-based interpretability framework for neural networks that treats neurons as epistemic reasons for propositions, yielding per-neuron reasons vectors and a strength metric to quantify how strongly a neuron supports specific propositions. Grounded in a formal theory of reasons, the method aggregates both logico-symbolic and Bayesian perspectives and applies across architectures via forward passes. Empirical results span LeNet on MNIST, robustness and fairness improvements through a doxastic-reason loss, and mechanistic interpretability in LLMs, demonstrating faithfulness, correctness, and scalability. The findings suggest that aligning model mechanisms with a principled notion of reasons can enhance robustness, fairness, and transparency while preserving accuracy.

Abstract

We propose a new interpretability method for neural networks, which is based on a novel mathematico-philosophical theory of reasons. Our method computes a vector for each neuron, called its reasons vector. We then can compute how strongly this reasons vector speaks for various propositions, e.g., the proposition that the input image depicts digit 2 or that the input prompt has a negative sentiment. This yields an interpretation of neurons, and groups thereof, that combines a logical and a Bayesian perspective, and accounts for polysemanticity (i.e., that a single neuron can figure in multiple concepts). We show, both theoretically and empirically, that this method is: (1) grounded in a philosophically established notion of explanation, (2) uniform, i.e., applies to the common neural network architectures and modalities, (3) scalable, since computing reason vectors only involves forward-passes in the neural network, (4) faithful, i.e., intervening on a neuron based on its reason vector leads to expected changes in model output, (5) correct in that the model's reasons structure matches that of the data source, (6) trainable, i.e., neural networks can be trained to improve their reason strengths, (7) useful, i.e., it delivers on the needs for interpretability by increasing, e.g., robustness and fairness.

Paper Structure

This paper contains 35 sections, 5 equations, 17 figures.

Figures (17)

  • Figure 1: Left: The triangle of interpretability. Right: The activation matrix.
  • Figure 2: Left: For each neuron in the different layers of LeNet, the strength with which it speaks for (positive) or against (negative) the proposition 'The input depicts digit 3'. The number below the bars indicates the number of neurons in the layer. Right: For each digit $d$ (shown below each bar), the reasons strength of the output neurons speak for 'The input depicts digit $d$'.
  • Figure 3: Left: intervening on neurons speaking against a digit to flip the prediction away from that digit. Right: intervening on neurons speaking for a digit to flip the prediction to that digit.
  • Figure 4: Left: After clustering together worlds (using PCA) that are internally similar according to the neurons in the hidden linear layer, they also are externally similar, i.e., have the same label. Right: This is not yet true for neurons in the first convolutional layer.
  • Figure 5: Left: Reason strength of every neuron in the residual stream. Right: Generating output with intervention on the 'positivity' neurons and the 'negativity' neurons, respectively.
  • ...and 12 more figures