Table of Contents
Fetching ...

New Statistical Framework for Extreme Error Probability in High-Stakes Domains for Reliable Machine Learning

Umberto Michelucci, Francesca Venturini

TL;DR

This work introduces a rigorous EVT-based framework to quantify extreme prediction errors in high-stakes machine learning, addressing the limitations of average-only validation metrics. By integrating EVT with Monte Carlo cross-validation, it enables estimation of tail risks through blocking (GEV) and threshold-exceedance (GPD) approaches, demonstrated on synthetic data and two real healthcare-related datasets. The results reveal meaningful tail-risk bounds (e.g., 95% worst-case errors) that often exceed what MAE/MSE suggest, highlighting the importance of tail-aware validation for safe AI deployment. The framework offers actionable tail-risk metrics and a practical workflow, with discussion of limitations (stationarity, threshold choice, computational cost) and directions for extending EVT to classification and more robust, scalable tooling.

Abstract

Machine learning is vital in high-stakes domains, yet conventional validation methods rely on averaging metrics like mean squared error (MSE) or mean absolute error (MAE), which fail to quantify extreme errors. Worst-case prediction failures can have substantial consequences, but current frameworks lack statistical foundations for assessing their probability. In this work a new statistical framework, based on Extreme Value Theory (EVT), is presented that provides a rigorous approach to estimating worst-case failures. Applying EVT to synthetic and real-world datasets, this method is shown to enable robust estimation of catastrophic failure probabilities, overcoming the fundamental limitations of standard cross-validation. This work establishes EVT as a fundamental tool for assessing model reliability, ensuring safer AI deployment in new technologies where uncertainty quantification is central to decision-making or scientific analysis.

New Statistical Framework for Extreme Error Probability in High-Stakes Domains for Reliable Machine Learning

TL;DR

This work introduces a rigorous EVT-based framework to quantify extreme prediction errors in high-stakes machine learning, addressing the limitations of average-only validation metrics. By integrating EVT with Monte Carlo cross-validation, it enables estimation of tail risks through blocking (GEV) and threshold-exceedance (GPD) approaches, demonstrated on synthetic data and two real healthcare-related datasets. The results reveal meaningful tail-risk bounds (e.g., 95% worst-case errors) that often exceed what MAE/MSE suggest, highlighting the importance of tail-aware validation for safe AI deployment. The framework offers actionable tail-risk metrics and a practical workflow, with discussion of limitations (stationarity, threshold choice, computational cost) and directions for extending EVT to classification and more robust, scalable tooling.

Abstract

Machine learning is vital in high-stakes domains, yet conventional validation methods rely on averaging metrics like mean squared error (MSE) or mean absolute error (MAE), which fail to quantify extreme errors. Worst-case prediction failures can have substantial consequences, but current frameworks lack statistical foundations for assessing their probability. In this work a new statistical framework, based on Extreme Value Theory (EVT), is presented that provides a rigorous approach to estimating worst-case failures. Applying EVT to synthetic and real-world datasets, this method is shown to enable robust estimation of catastrophic failure probabilities, overcoming the fundamental limitations of standard cross-validation. This work establishes EVT as a fundamental tool for assessing model reliability, ensuring safer AI deployment in new technologies where uncertainty quantification is central to decision-making or scientific analysis.

Paper Structure

This paper contains 15 sections, 2 theorems, 10 equations, 8 figures, 3 tables.

Key Result

Theorem 1

If there exist a sequence of constants $a_n>0$ and $b_n$ such that for a non-degenerate distribution $G$, then $G$ is a member of the Generalised Extreme Value (GEV) family defined on $\{z: 1+\xi(z-\mu)/\sigma>0\}$, where $-\infty<\mu<\infty$, $\sigma>0$ and $-\infty < \xi < \infty$.

Figures (8)

  • Figure 1: Process workflow for applying EVT to machine learning with synthetic data. Panel (A) shows the blocking method, while Panel (B) illustrates the threshold-based approach.
  • Figure 2: Distribution of $G^j_n$ (in panel (A)) and $M_n^j$ (in panel (B)). The blue vertical line indicates the average of the metric MAE in panel (A) and MSE in panel (B). It is important to clarify that although the symbol $\mu$ is commonly associated with averages, in this context, it does not represent the mean of the values.
  • Figure 3: Return plots to assess the goodness of the fits of the data to the GEV distribution families.
  • Figure 4: The analysis conducted using datasets $B_3$ and $B_4$, as outlined in the text, reveals the distribution of $\epsilon$ values exceeding the threshold of 15, depicted as light yellow bars. The red line represents the fitted generalized Pareto distribution, characterized by parameters $\xi=-0.43$, $u=15$, and $\sigma=3.57$.
  • Figure 5: The figure highlights the variability between average and extreme errors across models. SVR stands out by reducing both averages and extremes, making it, for example, suitable for real-world applications. Overfitting is evident in models like the decision tree and random forest, where training errors are minimal, but test errors remain high. The visualization underscores the importance of analyzing extreme values, using methods like GEV or generalized Pareto distributions, to better understand worst-case scenarios.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2