Table of Contents
Fetching ...

Negative impact of heavy-tailed uncertainty and error distributions on the reliability of calibration statistics for machine learning regression tasks

Pascal Pernot

TL;DR

This work analyzes how heavy-tailed uncertainty and error distributions undermine the reliability of average calibration statistics for regression in ML-UQ. It contrasts relative calibration error ($RCE$) and mean-squared-score ($ZMS$) approaches, showing $RCE$ is highly sensitive to tails and outliers while $ZMS$ is more robust but not infallible. Through synthetic experiments and real ML-UQ datasets, it demonstrates widespread tail heaviness in uncertainty and error distributions, leading to unreliable bootstrap CIs for mean-based metrics and frequent disagreements between calibration diagnostics. The paper proposes tailedness screening with robust metrics, cautions against overreliance on mean-based calibration, and suggests alternative uncertainty representations or adjusted formulations to improve calibration validation in practice.

Abstract

Average calibration of the (variance-based) prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV); the alternative is to compare the mean squared z-scores (ZMS) to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals becomes unreliable for heavy-tailed uncertainty and error distributions, which seems to be a frequent feature of ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context, still acknowledging that datasets with heavy-tailed z-scores distributions should be considered with great care. Unfortunately, the same problem is expected to affect also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. Several solutions to circumvent the outlined problems are proposed.

Negative impact of heavy-tailed uncertainty and error distributions on the reliability of calibration statistics for machine learning regression tasks

TL;DR

This work analyzes how heavy-tailed uncertainty and error distributions undermine the reliability of average calibration statistics for regression in ML-UQ. It contrasts relative calibration error () and mean-squared-score () approaches, showing is highly sensitive to tails and outliers while is more robust but not infallible. Through synthetic experiments and real ML-UQ datasets, it demonstrates widespread tail heaviness in uncertainty and error distributions, leading to unreliable bootstrap CIs for mean-based metrics and frequent disagreements between calibration diagnostics. The paper proposes tailedness screening with robust metrics, cautions against overreliance on mean-based calibration, and suggests alternative uncertainty representations or adjusted formulations to improve calibration validation in practice.

Abstract

Average calibration of the (variance-based) prediction uncertainties of machine learning regression tasks can be tested in two ways: one is to estimate the calibration error (CE) as the difference between the mean absolute error (MSE) and the mean variance (MV); the alternative is to compare the mean squared z-scores (ZMS) to 1. The problem is that both approaches might lead to different conclusions, as illustrated in this study for an ensemble of datasets from the recent machine learning uncertainty quantification (ML-UQ) literature. It is shown that the estimation of MV, MSE and their confidence intervals becomes unreliable for heavy-tailed uncertainty and error distributions, which seems to be a frequent feature of ML-UQ datasets. By contrast, the ZMS statistic is less sensitive and offers the most reliable approach in this context, still acknowledging that datasets with heavy-tailed z-scores distributions should be considered with great care. Unfortunately, the same problem is expected to affect also conditional calibrations statistics, such as the popular ENCE, and very likely post-hoc calibration methods based on similar statistics. Several solutions to circumvent the outlined problems are proposed.
Paper Structure (25 sections, 21 equations, 11 figures, 6 tables)

This paper contains 25 sections, 21 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Fit of the squared uncertainties (histogram) by an Inverse-Gamma $\Gamma^{-1}(\nu,\nu)$ distribution (blue line).
  • Figure 2: Fit of the squared errors (histogram) by a Fisher-Snedecor $F(1,\nu)$ distribution (blue line). The red curves represent the distributions for NIG models compatible with Fig. \ref{['fig:fituE2']}.
  • Figure 3: $\beta_{GM}$ skewness and $\kappa_{CS}$ kurtosis values for samples (size $5\times10^{5}$) issued from Fisher-Snedecor $F(1,\nu)$ distributions (blue dots) and from Inverse-Gamma $\Gamma^{-1}(\nu/2,\nu/2)$ distributions (red triangles). The dashed horizontal line represents the limit for a squared normal variate.
  • Figure 4: Comparison of the estimated values of RCE and 1-ZMS for a series of datasets generated by TIG models and characterized by their error skewness parameter $\beta_{GM}(E^{2})$, or $\beta_{GM}(Z^{2})$ for 1-ZMS. The symbols and 2$\sigma$ error bars summarize a sample of $10^{3}$ Monte Carlo runs.
  • Figure 5: Validation probability of the ZMS and RCE statistics for calibrated datasets generated by two scenarios: (left) NIG with $\nu_{IG}$ as parameter of the IG distribution; (right) TIG with $\nu_{D}$ as parameter of the generative $D=t_{s}$ distribution. The corresponding average values of $\beta_{GM}$ are reported on the upper axis.
  • ...and 6 more figures