Table of Contents
Fetching ...

No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo

Abstract

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

Abstract

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

Paper Structure

This paper contains 27 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Uncertainty attribution method by Bley et al. bley2025explaining. Predictive variance is estimated via ensemble-based UQ. Each of $K$ predictions is explained, and uncertainty attributions are derived as diagonal elements of the covariance matrix of these explanations. Illustration adapted from Bley et al. bley2025explaining.
  • Figure 2: Uncertainty and feature attributions using mcd and (a) LRP and (b) Integrated Gradients for test samples from the MNIST dataset. Dark blue pixels indicate low importance and dark red pixels indicate high importance to predictive uncertainty (uncertainty attribution) or the prediction (feature attribution).
  • Figure 3: Overview of the proposed evaluation framework. We combine a feature attribution method with an uq method (blue panel) to generate uncertainty attributions. We evaluate these uncertainty attributions by adapting and extending properties (orange panel) and metrics (red panel) from feature attribution to the uncertainty setting; novel categories and metrics appear in italics. Metric results are aggregated into an overall score for each uncertainty attribution method.
  • Figure 4: Experimental setup. Using 5-fold cross-validation, we train a CNN for MNIST and an MLP for Wine Quality. We compute uncertainty attributions for 100 test samples per fold using combinations of UQ and feature attribution methods (blue panel), rate them using our evaluation framework, and assess metrics via sanity checks.
  • Figure 5: Metric scores for the Wine Quality dataset with 95% confidence interval. Arrows indicate whether lower ($\downarrow$) or higher ($\uparrow$) values are better. Gradient-based methods (IxG, IG) and LRP outperform perturbation-based methods on most metrics for mcd and mcdc. mcdc shows better repeatability, feature flipping, and higher UCS than MCD.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: Uncertainty conveyance similarity (UCS)