Table of Contents
Fetching ...

Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models

Aurora Grefsrud, Nello Blaser, Trygve Buanes

TL;DR

This study uses the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation and finds none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points.

Abstract

Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms show reasonably good calibration performance on our synthetic test sets, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers that are using or developing methods of uncertainty estimation for scientific data-driven modeling and analysis.

Calibrated and uncertain? Evaluating uncertainty estimates in binary classification models

TL;DR

This study uses the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation and finds none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points.

Abstract

Rigorous statistical methods, including parameter estimation with accompanying uncertainties, underpin the validity of scientific discovery, especially in the natural sciences. With increasingly complex data models such as deep learning techniques, uncertainty quantification has become exceedingly difficult and a plethora of techniques have been proposed. In this case study, we use the unifying framework of approximate Bayesian inference combined with empirical tests on carefully created synthetic classification datasets to investigate qualitative properties of six different probabilistic machine learning algorithms for class probability and uncertainty estimation: (i) a neural network ensemble, (ii) neural network ensemble with conflictual loss, (iii) evidential deep learning, (iv) a single neural network with Monte Carlo Dropout, (v) Gaussian process classification and (vi) a Dirichlet process mixture model. We check if the algorithms produce uncertainty estimates which reflect commonly desired properties, such as being well calibrated and exhibiting an increase in uncertainty for out-of-distribution data points. Our results indicate that all algorithms show reasonably good calibration performance on our synthetic test sets, but none of the deep learning based algorithms provide uncertainties that consistently reflect lack of experimental evidence for out-of-distribution data points. We hope our study may serve as a clarifying example for researchers that are using or developing methods of uncertainty estimation for scientific data-driven modeling and analysis.

Paper Structure

This paper contains 28 sections, 25 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: The conditional and marginal generating distributions for the two datasets. The upper row shows the distribution from which we sampled dataset A, with parameters $\alpha_{1}, \eta_{1} = [5, 2]$, $\alpha_{2}, \eta_{2} = [3, 6]$. The lower row shows the distribution from which we sampled dataset B, with parameters $\alpha_{1}, \eta_{1} = [3, 2]$ and $\alpha_{2}, \eta_{2} = [3, 4]$. The panels from left to right show the conditional distributions $P(c^j=1|r, \alpha_1, \alpha_2, \eta_1, \eta_2, P_c)=\nu^j_\infty(r)$, $p(r|c, \alpha_j, \eta_j)$, and the marginal distributions $p(r|\alpha_1, \alpha_2, \eta_1, \eta_2, P_c)$ and $P(c)$.
  • Figure 2: Subsets of training datasets A (left) and B (right) with $N_{train}=250, 5000$ and $10000$ data points. The upper row shows the data points of class 1 in red and the lower row shows the data points of class 2 in blue.
  • Figure 3: Estimated probabilities (top row) and uncertainties (bottom row) for class 1 for the different algorithms for dataset A as a function of radius $|\mathbf{x}|$. The error bars indicate the entire spread of the data over polar angle $\phi$, while the markers indicate the sample average. The long-run frequency distribution (solid black line) is plotted for reference.
  • Figure 4: Estimated probabilities (top row) and uncertainties (bottom row) for class 1 for the different algorithms for dataset B as a function of radius $|\mathbf{x}|$. The error bars indicate the entire spread of the data over polar angle $\phi$. The long-run frequency distribution (solid black line) is plotted for reference.
  • Figure 5: Calibration metrics of the test set as a function of number of training data points $N_\text{train}$. The six subplots show the scores of the different models: neural network ensemble (NNE, blue line), neural network ensemble with conflictual loss (CL, orange line), neural network using evidential deep learning (EDL, green line), neural network with Monte Carlo Dropout (MCD, red line), Gaussian Process classification (GP, purple line) and a Dirichlet Process Mixture Model (DPMM, brown line). The metrics calculated are the accuracy (ACC), estimated calibration error (ECE), cross entropy loss (LogLoss), model calibration error (Z), Wasserstein-1 distance (WD) and Kullback-Leibler divergence (KL-div). The dashed black line in the top left plot indicates the optimal accuracy of the test set.
  • ...and 11 more figures