Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

Pascal Pernot

Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

Pascal Pernot

TL;DR

The paper tackles the problem that average calibration alone is insufficient for reliable uncertainty quantification in ML regression, particularly across input feature space. It proposes a unified, z-scores based validation framework to assess both consistency (calibration conditioned on uncertainty) and adaptivity (calibration conditioned on input features) using Local Z-Mean/Squares (LZM/LZMS) and related metrics, with careful attention to binning and confidence intervals. The QM9 atomization energy example demonstrates good average calibration but suboptimal adaptivity, highlighting the practical importance of testing adaptivity alongside consistency. The work provides actionable diagnostics (LZMS, f_v,ZMS, bootstrap CIs, stratified binning) and an LRCE alternative to quantify local calibration, offering a structured pathway for more reliable molecule- and material-specific uncertainty estimates in ML-UQ.

Abstract

Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.

Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 5 figures, 2 tables)

This paper contains 18 sections, 7 equations, 5 figures, 2 tables.

Introduction
Scope and limitations of the study
Structure of the article
Validation of variance-based UQ metrics
Average calibration
Individual, conditional and local calibration
Validation methods
Homoscedasticity plots of z-scores
Local calibration
Local Z-Mean and Z-Mean-Squares analysis
Validation metrics
Binning/grouping strategies
Effect of equal-size binning for stratified conditioning variables.
Stratified binning.
Choice of conditioning variables or groups for adaptivity.
...and 3 more sections

Figures (5)

Figure 1: Flowchart of the z-scores-based validation framework.
Figure 2: QM9 dataset: z-scores vs. uncertainty (a), molecular mass (b) and the fraction of heteroatoms (c). Running statistics (mean ($<Z>$) in red and mean squares ($<Z^{2}>$) in orange) are estimated for a sliding window of size $M/100$.
Figure 3: QM9 validation dataset. Consistency and adaptivity validation plots based on 100 equal-size bins: LZM and LZMS analyses and ACF of LZMS vs. $u_{E}$ (a, d, g), molecular mass $X_{1}$ (b, e, h) and fraction of heteroatoms $X_{2}$ (c, f, i). For the LZM and LZMS analyses (a-f), the red symbols depict confidence intervals that do not contain the target statistic (0.0 for $<Z>$; 1.0 for $<Z^{2}>$), and the mean statistic (for the whole dataset) is reported in the right margin, with the same color code as for the local statistics. The corresponding $f_{v}$ statistics are reported in Fig. \ref{['fig:fVal']}.
Figure 4: Fraction of validated bins for $<Z>$ (left) and $<Z^{2}>$ (right) according to three conditioning variables (a-c). The error bars depict 95 % confidence intervals. The fractions should ideally be compatible with the 0.95 target (horizontal dashed line). The "Nominal" values (black circles) result from equal-sized binning with 100 bins of the dataset and the error bars are estimated from a binomial distribution. They summarize the LZM and LZMS analyses reported in Fig. \ref{['fig:eqBin']}. The "Random order" values (red squares) display the mean and 95 % confidence interval for a random ordering of the dataset (based on 1000 permutations). The "Random + Binomial" values (orange diamonds) combine the binomial uncertainty with the previous values. The "Stratified" values (green triangles) are the statistics for the binning scheme based on the preservation of strata, with binomial uncertainty. They summarize the LZM and LZMS analyses reported in Fig. \ref{['fig:strata']}.
Figure 5: QM9 validation dataset. LZM, LZMS analyses vs. $u_{E}$ (a, d), molecular mass (b, e) and fraction of heteroatoms (c, f). The data have been aggregated to get a minimum of 100 points per stratum. The red symbols depict confidence intervals that do not contain the target statistic (0.0 for $<Z>$; 1.0 for $<Z^{2}>$). The mean statistic (over the whole dataset) is reported in the right margin, with the same color code as for the local statistics. The corresponding $f_{v}$ statistics are reported in Fig. \ref{['fig:fVal']}.

Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

TL;DR

Abstract

Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity

Authors

TL;DR

Abstract

Table of Contents

Figures (5)