A metrological framework for uncertainty evaluation in machine learning classification models
Samuel Bilson, Maurice Cox, Anna Pustogvar, Andrew Thompson
TL;DR
The paper develops a metrological framework for evaluating uncertainty in ML classification outputs by treating the classifier's nominal prediction as a probability mass function over classes. It argues that PMFs, not traditional error metrics alone, are essential for expressing and propagating uncertainty through multistage measurement models, enabling traceability and rigorous uncertainty handling in domains like climate observation and medical diagnostics. The framework analyzes a range of uncertainty statistics (e.g., WVR, UVR, SDM, entropy, IQV, CNV) and demonstrates their behavior on two case studies: land cover classification with Bayesian generative modelling and atrial fibrillation detection with a CNN using Monte Carlo Dropout. Findings indicate entropy is the most sensitive to PMF changes, while UVR provides robust uncertainty assessment; for binary tasks several statistics coincide, simplifying interpretation. The work highlights opportunities to extend the GUM for nominal properties and to better integrate metrology with uncertainty-aware ML pipelines in practice.
Abstract
Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for nominal properties. This framework is based on probability mass functions and summary statistics thereof, and it is applicable to ML classification. We also illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.
