Table of Contents
Fetching ...

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Cornelia Gruber, Patrick Oliver Schenk, Malte Schierholz, Frauke Kreuter, Göran Kauermann

TL;DR

This paper argues that uncertainty in supervised machine learning cannot be adequately captured by a simple aleatoric/epistemic dichotomy. It formalizes definitions from a statistical perspective, connects them to classical concepts like the bias-variance trade-off, and highlights multiple data-related sources of uncertainty, including omitted variables, measurement and label errors, and non-i.i.d. data. It introduces the Total Survey Error framework and discusses deployment-related challenges such as distribution shift and transportability, emphasizing that data quality and the data production process are central to reliable predictions. The work advocates a data-centric view of uncertainty, outlines when conventional assumptions fail, and outlines future directions for principled uncertainty quantification in practice.

Abstract

Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

TL;DR

This paper argues that uncertainty in supervised machine learning cannot be adequately captured by a simple aleatoric/epistemic dichotomy. It formalizes definitions from a statistical perspective, connects them to classical concepts like the bias-variance trade-off, and highlights multiple data-related sources of uncertainty, including omitted variables, measurement and label errors, and non-i.i.d. data. It introduces the Total Survey Error framework and discusses deployment-related challenges such as distribution shift and transportability, emphasizing that data quality and the data production process are central to reliable predictions. The work advocates a data-centric view of uncertainty, outlines when conventional assumptions fail, and outlines future directions for principled uncertainty quantification in practice.

Abstract

Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.
Paper Structure (49 sections, 59 equations, 4 figures, 2 tables)

This paper contains 49 sections, 59 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Linear model example. The solid line shows the true relationship between $x$ and $y$. The estimated regression line is dashed. 90% prediction interval in grey. Even though the true model class or hypothesis space is known, i.e., simple linear model (no model uncertainty), it is not possible to predict $y$ precisely (aleatoric uncertainty). Since the regression line is estimated with finite data, there is a discrepancy between the true parameters $\theta$ and the estimate $\hat{\theta}$ (estimation uncertainty).
  • Figure 2: Errors in $X$ setting
  • Figure 3: Errors in $Y$ setting
  • Figure 4: Left: Total Survey Error Framework Components groves_survey_2009, Right: Total Data Quality Framework in Machine Learning (Dimensions of TDQ by west.wagner.2023.tdq is licensed under https://creativecommons.org/licenses/by-nc/4.0/).