Table of Contents
Fetching ...

A Unified Evaluation Framework for Epistemic Predictions

Shireen Kudukkil Manchingal, Muhammad Mubashar, Kaizheng Wang, Fabio Cuzzolin

TL;DR

The paper tackles the problem of comparing uncertainty-aware classifiers that produce diverse epistemic predictions by introducing a unified evaluation framework. It maps all prediction types to credal sets within the probability simplex and defines a metric $\mathcal{E} = d(y,\hat{y}) + \lambda \cdot NS[m]$ that blends accuracy and imprecision, with $d$ instantiated as $D_{KL}$ to the credal-set boundary. The approach enables cross-model ranking and tailored model selection for real-world tasks, validated on CIFAR-10, CIFAR-100, and MNIST across multiple uncertainty paradigms; it also provides a practical credal-set construction via coherent lower probabilities and Möbius inversion. The framework supports application-driven decisions (e.g., abstention vs mandatory action) and offers insights into trade-offs, limitations, and future directions such as training-time loss formulations that optimize the proposed metric.

Abstract

Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.

A Unified Evaluation Framework for Epistemic Predictions

TL;DR

The paper tackles the problem of comparing uncertainty-aware classifiers that produce diverse epistemic predictions by introducing a unified evaluation framework. It maps all prediction types to credal sets within the probability simplex and defines a metric that blends accuracy and imprecision, with instantiated as to the credal-set boundary. The approach enables cross-model ranking and tailored model selection for real-world tasks, validated on CIFAR-10, CIFAR-100, and MNIST across multiple uncertainty paradigms; it also provides a practical credal-set construction via coherent lower probabilities and Möbius inversion. The framework supports application-driven decisions (e.g., abstention vs mandatory action) and offers insights into trade-offs, limitations, and future directions such as training-time loss formulations that optimize the proposed metric.

Abstract

Predictions of uncertainty-aware models are diverse, ranging from single point estimates (often averaged over prediction samples) to predictive distributions, to set-valued or credal-set representations. We propose a novel unified evaluation framework for uncertainty-aware classifiers, applicable to a wide range of model classes, which allows users to tailor the trade-off between accuracy and precision of predictions via a suitably designed performance metric. This makes possible the selection of the most suitable model for a particular real-world application as a function of the desired trade-off. Our experiments, concerning Bayesian, ensemble, evidential, deterministic, credal and belief function classifiers on the CIFAR-10, MNIST and CIFAR-100 datasets, show that the metric behaves as desired.

Paper Structure

This paper contains 34 sections, 20 equations, 55 figures, 10 tables, 1 algorithm.

Figures (55)

  • Figure 1: Different types of uncertainty-aware model predictions, shown in a unit simplex of probability distributions defined on the list of classes $\mathbf{Y}= \{a, b, c\}$. Our proposed evaluation framework uses a metric which combines, for each input $\mathbf{x}$, a distance (arrows) between the corresponding ground truth (e.g., $(0,1,0)$) and the epistemic predictions generated by the various models (in the form of credal sets), and a measure of the extent of the credal prediction (non-specificity).
  • Figure 2: Measures of KL divergence (top left), Non-specificity (top right), Evaluation Metric (bottom left) for both Correctly (CC) and Incorrectly Classified (ICC) samples from CIFAR-10, and Evaluation metric vs trade-off parameter (bottom right), for all models, on the CIFAR-10 dataset.
  • Figure 3: Visualizations of 100 prediction samples obtained prior to Bayesian Model Averaging and corresponding Bayesian Model Averaged prediction in two real scenarios from CIFAR-10.
  • Figure 4: Visualizations of belief and mass predictions on the power-set space and its mapping to the label space $\mathbf{Y}$ using pignistic probabilities on the CIFAR-10 dataset.
  • Figure 5: Probability simplices illustrating the convex closure of predictions and credal sets for the Bayesian model (LB-BNN) across three classes of the CIFAR-10 dataset.
  • ...and 50 more figures