Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View
Cornelia Gruber, Patrick Oliver Schenk, Malte Schierholz, Frauke Kreuter, Göran Kauermann
TL;DR
This paper argues that uncertainty in supervised machine learning cannot be adequately captured by a simple aleatoric/epistemic dichotomy. It formalizes definitions from a statistical perspective, connects them to classical concepts like the bias-variance trade-off, and highlights multiple data-related sources of uncertainty, including omitted variables, measurement and label errors, and non-i.i.d. data. It introduces the Total Survey Error framework and discusses deployment-related challenges such as distribution shift and transportability, emphasizing that data quality and the data production process are central to reliable predictions. The work advocates a data-centric view of uncertainty, outlines when conventional assumptions fail, and outlines future directions for principled uncertainty quantification in practice.
Abstract
Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.
