On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

Pascal Pernot

On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

Pascal Pernot

TL;DR

The paper addresses the challenge of validating prediction-uncertainty calibration in regression when uncertainty and error distributions exhibit heavy tails. It proposes an interval-based approach using Prediction Interval Coverage Probability (PICP), leveraging the near-constant enlargement factor $k_{95}$ for unit-variance $Z\sim t_s(\nu)$ and the empirical robustness of 95% coverage for $\nu>3$, to provide a faster, more reliable calibration test than variance-based metrics like ZMS. Applying PICP to Jacobs et al.'s 33 datasets shows 10 untestable cases due to heavy tails, with 18 datasets validated and 5 invalid, while local coverage analysis (LCP) confirms overall uniformity of coverage for the validated sets. The approach offers a scalable, practical framework for uncertainty calibration validation in large regression datasets, with potential improvements from active learning to mitigate tail issues.

Abstract

This short study presents an opportunistic approach to a (more) reliable validation method for prediction uncertainty average calibration. Considering that variance-based calibration metrics (ZMS, NLL, RCE...) are quite sensitive to the presence of heavy tails in the uncertainty and error distributions, a shift is proposed to an interval-based metric, the Prediction Interval Coverage Probability (PICP). It is shown on a large ensemble of molecular properties datasets that (1) sets of z-scores are well represented by Student's-$t(ν)$ distributions, $ν$ being the number of degrees of freedom; (2) accurate estimation of 95 $\%$ prediction intervals can be obtained by the simple $2σ$ rule for $ν>3$; and (3) the resulting PICPs are more quickly and reliably tested than variance-based calibration metrics. Overall, this method enables to test 20 $\%$ more datasets than ZMS testing. Conditional calibration is also assessed using the PICP approach.

On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

TL;DR

for unit-variance

and the empirical robustness of 95% coverage for

, to provide a faster, more reliable calibration test than variance-based metrics like ZMS. Applying PICP to Jacobs et al.'s 33 datasets shows 10 untestable cases due to heavy tails, with 18 datasets validated and 5 invalid, while local coverage analysis (LCP) confirms overall uniformity of coverage for the validated sets. The approach offers a scalable, practical framework for uncertainty calibration validation in large regression datasets, with potential improvements from active learning to mitigate tail issues.

Abstract

distributions,

being the number of degrees of freedom; (2) accurate estimation of 95

prediction intervals can be obtained by the simple

rule for

; and (3) the resulting PICPs are more quickly and reliably tested than variance-based calibration metrics. Overall, this method enables to test 20

more datasets than ZMS testing. Conditional calibration is also assessed using the PICP approach.

Paper Structure (12 sections, 3 equations, 12 figures, 1 table)

This paper contains 12 sections, 3 equations, 12 figures, 1 table.

Introduction
Interval-based average calibration testing
PICP and its validation
Coverage of $2\sigma$ intervals for $t_{s}(\nu)$ as a function of $\nu$
Simulations
Application
Shape of $Z$ distributions
Untestable datasets
PICP analysis
LCP analysis
Conclusions
Distributions of $Z^{2}$

Figures (12)

Figure 1: Enlargement factor and coverage probabilities for a $t_{s}(\nu)$ distribution as a function of $\nu$: (a) $k_{95}$ ; (b) coverage probability of a $[-a,a]$ interval for (b) $a=1.96$; (c) $a=1,\,1.96,\,2.83$. The grayed area depicts a 0.005 deviation around the asymptotic value.
Figure 2: PICP$_{95}$ values for $t(\nu)$ samples: (left) effective coverage as a function of $\nu$; (right) same data as a function of the $Z^{2}$ skewness. The 95% confidence intervals on the PICP values are displayed as error bars. The cyan curve is the theoretical curve, as seen in Fig. \ref{['fig:PICP']}(b). The gray area depicts the validity interval. The red points depict invalidated intervals that do not overlap the gray area.
Figure 3: PICP analysis at the $1\sigma$ (left) and $2\sigma$ (right) levels. The 95% confidence intervals on the PICP values are displayed as error bars. The points are color-coded into three classes: (1) gray for sets with $\beta_{GM}(Z^{2})\ge0.85$; (2) blue for calibrated sets; and (red) for uncalibrated sets.
Figure 4: Local PICP (LCP) analysis using $N=20$ uncertainty-based equal-size bins. The 95 % confidence intervals on the local PICP values are reported as error bars. The average PICP value is reported in the right margin. The gray area represents an admissible 0.005 deviation around the target value. Local values incompatible with this admissible area are colored in red. Gray points cannot be reliably tested.
Figure 5: Fig. \ref{['fig:LCP-1']}, continued.
...and 7 more figures

On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

TL;DR

Abstract

On the good reliability of an interval-based metric to validate prediction uncertainty for machine learning regression tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (12)