Table of Contents
Fetching ...

Why Machine Learning Models Fail to Fully Capture Epistemic Uncertainty

Sebastián Jiménez, Mira Jürgens, Willem Waegeman

TL;DR

This paper tackles the mismatch between common second-order uncertainty methods and the full epistemic uncertainty in ML models by introducing a fine-grained taxonomy and a simulation-based evaluation framework using a reference distribution that accounts for data and procedural randomness. It provides a regression-specific bias-variance decomposition within this framework and demonstrates that high model bias can cause underestimation of epistemic uncertainty, with bias often being misattributed to aleatoric uncertainty by many methods. Through synthetic experiments and a real NYC taxi dataset, the authors show that typical approaches (e.g., Deep Ensembles) predominantly capture procedural uncertainty and fail to represent data-driven epistemic components, leading to distorted uncertainty partitions. The work highlights the need for task-aware evaluation protocols and full representation of all epistemic sources to obtain reliable and interpretable uncertainty estimates for downstream tasks like active learning and out-of-distribution detection.

Abstract

In recent years various supervised learning methods that disentangle aleatoric and epistemic uncertainty based on second-order distributions have been proposed. We argue that these methods fail to capture critical components of epistemic uncertainty, particularly due to the often-neglected component of model bias. To show this, we make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models, and analyse how the classical bias-variance decomposition of the expected prediction error can be decomposed into different parts reflecting these uncertainties. By using a simulation-based evaluation protocol which encompasses epistemic uncertainty due to both procedural- and data-driven uncertainty components, we illustrate that current methods rarely capture the full spectrum of epistemic uncertainty. Through theoretical insights and synthetic experiments, we show that high model bias can lead to misleadingly low estimates of epistemic uncertainty, and common second-order uncertainty quantification methods systematically blur bias-induced errors into aleatoric estimates, thereby underrepresenting epistemic uncertainty. Our findings underscore that meaningful aleatoric estimates are feasible only if all relevant sources of epistemic uncertainty are properly represented.

Why Machine Learning Models Fail to Fully Capture Epistemic Uncertainty

TL;DR

This paper tackles the mismatch between common second-order uncertainty methods and the full epistemic uncertainty in ML models by introducing a fine-grained taxonomy and a simulation-based evaluation framework using a reference distribution that accounts for data and procedural randomness. It provides a regression-specific bias-variance decomposition within this framework and demonstrates that high model bias can cause underestimation of epistemic uncertainty, with bias often being misattributed to aleatoric uncertainty by many methods. Through synthetic experiments and a real NYC taxi dataset, the authors show that typical approaches (e.g., Deep Ensembles) predominantly capture procedural uncertainty and fail to represent data-driven epistemic components, leading to distorted uncertainty partitions. The work highlights the need for task-aware evaluation protocols and full representation of all epistemic sources to obtain reliable and interpretable uncertainty estimates for downstream tasks like active learning and out-of-distribution detection.

Abstract

In recent years various supervised learning methods that disentangle aleatoric and epistemic uncertainty based on second-order distributions have been proposed. We argue that these methods fail to capture critical components of epistemic uncertainty, particularly due to the often-neglected component of model bias. To show this, we make use of a more fine-grained taxonomy of epistemic uncertainty sources in machine learning models, and analyse how the classical bias-variance decomposition of the expected prediction error can be decomposed into different parts reflecting these uncertainties. By using a simulation-based evaluation protocol which encompasses epistemic uncertainty due to both procedural- and data-driven uncertainty components, we illustrate that current methods rarely capture the full spectrum of epistemic uncertainty. Through theoretical insights and synthetic experiments, we show that high model bias can lead to misleadingly low estimates of epistemic uncertainty, and common second-order uncertainty quantification methods systematically blur bias-induced errors into aleatoric estimates, thereby underrepresenting epistemic uncertainty. Our findings underscore that meaningful aleatoric estimates are feasible only if all relevant sources of epistemic uncertainty are properly represented.

Paper Structure

This paper contains 19 sections, 17 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: (a): Synthetic experiment where a high model bias is falsely attributed to aleatoric uncertainty: Due to limited data points on the left side of the $x$-axis, and high complexity of the generating function, the underlying epistemic uncertainty there is high. The heteroscedastic Gaussian process model (falsely) predicts it as high aleatoric uncertainty. See Sec. \ref{['sec: what part of epistemic uncertainty']} for details. (b): Sources of epistemic uncertainty in machine learning, where $p(y| \boldsymbol{x})$ denotes the true data-generating process. Different sources of epistemic uncertainty can lead to the estimate $\hat{p}(y|\boldsymbol{x})$ being further from the ground truth. See Sec. \ref{['sec: epistemic uncertainty definition']} for further details. Figure adapted from huang2023efficient.
  • Figure 2: Estimated procedural and data uncertainty of the reference distribution for increasing sample sizes. The reference distribution is estimated by resampling training data and procedural parameters, using $n_{\gamma}=10$ and $n_d =20$. Procedural and data uncertainty are calculated using the decomposition formula in Eqn. \ref{['eq: decomposition of variance']}. Even when the model predictions (complete line) have an important deviation (bias) from the ground-truth expected value (dashed line), the epistemic uncertainty estimate decreases with more data.
  • Figure 3: Deep Ensemble estimates, obtained with different sample sizes $N\in \{50,100,500\}$. The top row shows the training data points (blue crosses), the true conditional mean (dashed line), the model predictions (complete line), the $95\%$ confidence interval for aleatoric uncertainty (purple), as well as the true $95\%$ confidence interval of the aleatoric uncertainty (gray). The bottom row shows the estimate of epistemic uncertainty (green).
  • Figure 4: (a): Procedural and data uncertainty estimates from the reference distribution of the taxi trip duration dataset. The procedural uncertainty component is larger than the data uncertainty component. (b): Comparison between the epistemic uncertainty estimated by the Deep Ensemble and the epistemic uncertainty obtained from the reference distribution. The Deep Ensemble underestimates the epistemic uncertainty $80\%$ of the time. (c): Relation between the epistemic uncertainty estimated by the Deep Ensemble and the procedural uncertainty coming from the reference distribution. On average the Deep Ensemble's estimate captures the procedural uncertainty part of the total epistemic uncertainty.
  • Figure 5: Comparison of the estimated aleatoric uncertainty with the true (simulated) aleatoric noise across regions characterized by different levels of model bias and inherent data noise. The notation $\sigma_i$ for $i \in {1, 2, 3, 4}$ denotes the true aleatoric uncertainty for each of the four regions: 1. low bias and low true noise, 2. high bias and low true noise, 3. low bias and high true noise, and 4. high bias and high true noise. For the synthetic dataset, the division between high- and low-noise regions follows Eqn. \ref{['eq: sine data problem']}: points with $x < 0.6$ are considered low-noise, while those with $x \geq 0.6$ are considered high-noise. Within each noise region, samples are ranked according to their model bias, and the lower half of points (by bias magnitude) defines the low-bias region, while the upper half defines the high-bias region. For the taxi dataset, the high- and low-noise regions are determined using the $90$-th percentile of the simulated true aleatoric uncertainty, as described in Section \ref{['sec: real data experiments']}. Within each noise region, data points are further divided into low- and high-bias subsets based on the 70th percentile of the estimated model bias. This construction allows for a direct comparison between the estimated and true aleatoric uncertainty across systematically defined regions, highlighting how model bias affects the reliability of aleatoric uncertainty estimates in both synthetic and real-world scenarios.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Definition 1
  • Example 1: Deep Ensembles purely capture procedural uncertainty.
  • Example 2: DER may fail to provide faithful epistemic uncertainty estimates.