Model-agnostic variable importance for predictive uncertainty: an entropy-based approach

Danny Wood; Theodore Papamarkou; Matt Benatan; Richard Allmendinger

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach

Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger

TL;DR

This paper addresses the need to explain not only model predictions but also the uncertainty of those predictions in probabilistic, uncertainty-aware models. It introduces model-agnostic adaptations of permutation feature importance, partial dependence plots, and individual conditional expectations to quantify how features affect the predictive distribution’s likelihood and entropy, via Likelihood-PFI/Entropy-PFI and their PDP/ICE counterparts. The authors establish theoretical properties, discuss interpretation, and demonstrate how these measures reveal when features share information, when uncertainty arises from extrapolation, and how this impacts model performance, using synthetic and real-world datasets in classification and regression. The work offers practical, interpretable diagnostics for uncertainty sources that complement existing explainability approaches, with open-source code to facilitate adoption. Overall, the approach advances trustworthy AI by enabling nuanced, distribution-focused explanations that are agnostic to the underlying model.

Abstract

In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model's level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model's predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach

TL;DR

Abstract

Paper Structure (35 sections, 3 theorems, 32 equations, 15 figures)

This paper contains 35 sections, 3 theorems, 32 equations, 15 figures.

Introduction
Background
Uncertainty quantification
Feature importance
Permutation-based feature importance methods
Permutation feature importance (PFI)
Partial dependence plots (PDPs)
Individual conditional expectations (ICEs)
Explaining likelihood and uncertainty
Likelihood-PFI
Entropy-PFI
Properties of Entropy-PFI
Why conditional PFI is not useful in the context of entropy
How to interpret Entropy-PFI
When Entropy-PFI is zero
...and 20 more sections

Key Result

Proposition 1

If ${X_{-j}}$ is independent of $X_j$, then the Entropy-PFI is zero.

Figures (15)

Figure 1: Example of PDP and ICE plots. The blue line shows the PDP. The ICE curves (gray) show how the output of the model changes for individual examples as the feature of interest changes.
Figure 2: Visualisation of effects of PFI. The colour of each point shows the cluster to which the original (unpermuted) test example belonged. In the left panel, the original test set is shown, along with the contour lines for the entropy of a (hypothetical) model's predictive distribution. In the centre panel, the test set is shown after permuting feature 2. In the right panel, histograms of the entropy before and after permuting the second feature are shown.
Figure 3: Comparison of Likelihood-PFI and Entropy-PFI for three datasets, the second and third of which contain redundant features. When feature 10 is a copy of feature 1 (an informative feature), we see PFI-likelihood of feature 1 drop and PFI-entropy increase, and both PFI-likelihood and PFI-entropy increase for feature 10. When feature 10 is a copy of feature 5 (an uninformative feature), there is no effect on PFI-likelihood for either feature, and a small increase in PFI-entropy for both.
Figure 4: Likelihood-PFI and Entropy-PFI for features in a synthetic dataset using a Gaussian process model. Since features 1-4 share information with each other, their Likelihood-PFI is reduced relative to the independent feature 5. In contrast, their shared information means that they have higher Entropy-PFI, where feature 5's Entropy-PFI is negligible.
Figure 5: Likelihood-PFI and Entropy-PFI for datasets varying the amount of noise in the target variable. The datasets are the same as in Figure \ref{['fig:regression_pfi_toy']}, but with the variance of $\epsilon$ set to the value $\sigma^2$ in each case. We see that when reducing the amount of noise, both the Likelihood-PFI and Entropy-PFI increase.
...and 10 more figures

Theorems & Definitions (6)

Proposition 1
proof
Proposition 2
proof
Proposition 3
proof

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach

TL;DR

Abstract

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (6)