Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Cornelia Gruber; Patrick Oliver Schenk; Malte Schierholz; Frauke Kreuter; Göran Kauermann

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Cornelia Gruber, Patrick Oliver Schenk, Malte Schierholz, Frauke Kreuter, Göran Kauermann

TL;DR

This paper argues that uncertainty in supervised machine learning cannot be adequately captured by a simple aleatoric/epistemic dichotomy. It formalizes definitions from a statistical perspective, connects them to classical concepts like the bias-variance trade-off, and highlights multiple data-related sources of uncertainty, including omitted variables, measurement and label errors, and non-i.i.d. data. It introduces the Total Survey Error framework and discusses deployment-related challenges such as distribution shift and transportability, emphasizing that data quality and the data production process are central to reliable predictions. The work advocates a data-centric view of uncertainty, outlines when conventional assumptions fail, and outlines future directions for principled uncertainty quantification in practice.

Abstract

Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

TL;DR

Abstract

Paper Structure (49 sections, 59 equations, 4 figures, 2 tables)

This paper contains 49 sections, 59 equations, 4 figures, 2 tables.

Introduction
Related Work and Applications
Aleatoric and Epistemic Uncertainty in Supervised Machine Learning
Why Considering Uncertainties Helps
Sources of Uncertainty
Aleatoric and Epistemic Uncertainty
Aleatoric and Epistemic Uncertainty in Classical Statistics
Statistical Model
Aleatoric and Epistemic Uncertainty in the Bias-Variance Decomposition
Kullback-Leibler Divergence and Misspecified Models
The Role of Data - Model Uncertainty Revisited
General Comments
Uncertainty Due to Unobserved Variables
General Framework
Omitted Variables
...and 34 more sections

Figures (4)

Figure 1: Linear model example. The solid line shows the true relationship between $x$ and $y$. The estimated regression line is dashed. 90% prediction interval in grey. Even though the true model class or hypothesis space is known, i.e., simple linear model (no model uncertainty), it is not possible to predict $y$ precisely (aleatoric uncertainty). Since the regression line is estimated with finite data, there is a discrepancy between the true parameters $\theta$ and the estimate $\hat{\theta}$ (estimation uncertainty).
Figure 2: Errors in $X$ setting
Figure 3: Errors in $Y$ setting
Figure 4: Left: Total Survey Error Framework Components groves_survey_2009, Right: Total Data Quality Framework in Machine Learning (Dimensions of TDQ by west.wagner.2023.tdq is licensed under https://creativecommons.org/licenses/by-nc/4.0/).

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

TL;DR

Abstract

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Authors

TL;DR

Abstract

Table of Contents

Figures (4)