Explaining Bayesian Neural Networks
Kirill Bykov, Marina M. -C. Höhne, Adelaida Creosteanu, Klaus-Robert Müller, Frederick Klauschen, Shinichi Nakajima, Marius Kloft
TL;DR
This paper addresses the gap in Explainable AI for Bayesian Neural Networks by treating local explanations as distributions induced by the posterior over weights, $p(W|\mathcal{D}_{tr})$, and sampling multiple explanation maps to quantify uncertainty in explanations. It introduces a method-agnostic framework (UAI) that computes mean explanations and constructs Union/Intersection aggregations, plus an uncertainty-aware variant UAI^+ and clustering to reveal multi-modal explanation strategies. A key theoretical result shows that, for linear attribution operators, the explanation of the predictive mean equals the mean of explanations, enabling efficient summarization of the average behavior. Empirical results on CMNIST, ImageNet, and a pathology use case demonstrate that incorporating explanation uncertainty improves interpretability, highlights diverse reasoning modes, and aids in detecting spurious cues (Clever Hans), though the work is limited by the chosen posterior approximation and the need for dedicated metrics for explanation distributions. Overall, the framework provides a practical, scalable path to uncertainty-aware XAI, with potential impact for safety-critical deployment and more nuanced model debugging in real-world tasks.
Abstract
To advance the transparency of learning machines such as Deep Neural Networks (DNNs), the field of Explainable AI (XAI) was established to provide interpretations of DNNs' predictions. While different explanation techniques exist, a popular approach is given in the form of attribution maps, which illustrate, given a particular data point, the relevant patterns the model has used for making its prediction. Although Bayesian models such as Bayesian Neural Networks (BNNs) have a limited form of transparency built-in through their prior weight distribution, they lack explanations of their predictions for given instances. In this work, we take a step toward combining these two perspectives by examining how local attributions can be extended to BNNs. Within the Bayesian framework, network weights follow a probability distribution; hence, the standard point explanation extends naturally to an explanation distribution. Viewing explanations probabilistically, we aggregate and analyze multiple local attributions drawn from an approximate posterior to explore variability in explanation patterns. The diversity of explanations offers a way to further explore how predictive rationales may vary across posterior samples. Quantitative and qualitative experiments on toy and benchmark data, as well as on a real-world pathology dataset, illustrate that our framework enriches standard explanations with uncertainty information and may support the visualization of explanation stability.
