Negative impact of heavy-tailed uncertainty and error distributions on the reliability of calibration statistics for machine learning regression tasks
Pascal Pernot
TL;DR
This work analyzes how heavy-tailed uncertainty and error distributions undermine the reliability of average calibration statistics for regression in ML-UQ. It contrasts relative calibration error ($RCE$) and mean-squared-score ($ZMS$) approaches, showing $RCE$ is highly sensitive to tails and outliers while $ZMS$ is more robust but not infallible. Through synthetic experiments and real ML-UQ datasets, it demonstrates widespread tail heaviness in uncertainty and error distributions, leading to unreliable bootstrap CIs for mean-based metrics and frequent disagreements between calibration diagnostics. The paper proposes tailedness screening with robust metrics, cautions against overreliance on mean-based calibration, and suggests alternative uncertainty representations or adjusted formulations to improve calibration validation in practice.
Abstract
Average calibration of the (variance-based) prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV); the alternative is to compare the mean squared z-scores (ZMS) to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals becomes unreliable for heavy-tailed uncertainty and error distributions, which seems to be a frequent feature of ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context, still acknowledging that datasets with heavy-tailed z-scores distributions should be considered with great care. Unfortunately, the same problem is expected to affect also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. Several solutions to circumvent the outlined problems are proposed.
