Explaining Neural Networks with Reasons
Levin Hornischer, Hannes Leitgeb
TL;DR
This work introduces a reasoning-based interpretability framework for neural networks that treats neurons as epistemic reasons for propositions, yielding per-neuron reasons vectors and a strength metric to quantify how strongly a neuron supports specific propositions. Grounded in a formal theory of reasons, the method aggregates both logico-symbolic and Bayesian perspectives and applies across architectures via forward passes. Empirical results span LeNet on MNIST, robustness and fairness improvements through a doxastic-reason loss, and mechanistic interpretability in LLMs, demonstrating faithfulness, correctness, and scalability. The findings suggest that aligning model mechanisms with a principled notion of reasons can enhance robustness, fairness, and transparency while preserving accuracy.
Abstract
We propose a new interpretability method for neural networks, which is based on a novel mathematico-philosophical theory of reasons. Our method computes a vector for each neuron, called its reasons vector. We then can compute how strongly this reasons vector speaks for various propositions, e.g., the proposition that the input image depicts digit 2 or that the input prompt has a negative sentiment. This yields an interpretation of neurons, and groups thereof, that combines a logical and a Bayesian perspective, and accounts for polysemanticity (i.e., that a single neuron can figure in multiple concepts). We show, both theoretically and empirically, that this method is: (1) grounded in a philosophically established notion of explanation, (2) uniform, i.e., applies to the common neural network architectures and modalities, (3) scalable, since computing reason vectors only involves forward-passes in the neural network, (4) faithful, i.e., intervening on a neuron based on its reason vector leads to expected changes in model output, (5) correct in that the model's reasons structure matches that of the data source, (6) trainable, i.e., neural networks can be trained to improve their reason strengths, (7) useful, i.e., it delivers on the needs for interpretability by increasing, e.g., robustness and fairness.
