New Statistical Framework for Extreme Error Probability in High-Stakes Domains for Reliable Machine Learning
Umberto Michelucci, Francesca Venturini
TL;DR
This work introduces a rigorous EVT-based framework to quantify extreme prediction errors in high-stakes machine learning, addressing the limitations of average-only validation metrics. By integrating EVT with Monte Carlo cross-validation, it enables estimation of tail risks through blocking (GEV) and threshold-exceedance (GPD) approaches, demonstrated on synthetic data and two real healthcare-related datasets. The results reveal meaningful tail-risk bounds (e.g., 95% worst-case errors) that often exceed what MAE/MSE suggest, highlighting the importance of tail-aware validation for safe AI deployment. The framework offers actionable tail-risk metrics and a practical workflow, with discussion of limitations (stationarity, threshold choice, computational cost) and directions for extending EVT to classification and more robust, scalable tooling.
Abstract
Machine learning is vital in high-stakes domains, yet conventional validation methods rely on averaging metrics like mean squared error (MSE) or mean absolute error (MAE), which fail to quantify extreme errors. Worst-case prediction failures can have substantial consequences, but current frameworks lack statistical foundations for assessing their probability. In this work a new statistical framework, based on Extreme Value Theory (EVT), is presented that provides a rigorous approach to estimating worst-case failures. Applying EVT to synthetic and real-world datasets, this method is shown to enable robust estimation of catastrophic failure probabilities, overcoming the fundamental limitations of standard cross-validation. This work establishes EVT as a fundamental tool for assessing model reliability, ensuring safer AI deployment in new technologies where uncertainty quantification is central to decision-making or scientific analysis.
