Calibration in Machine Learning Uncertainty Quantification: beyond consistency to target adaptivity
Pascal Pernot
TL;DR
The paper tackles the problem that average calibration alone is insufficient for reliable uncertainty quantification in ML regression, particularly across input feature space. It proposes a unified, z-scores based validation framework to assess both consistency (calibration conditioned on uncertainty) and adaptivity (calibration conditioned on input features) using Local Z-Mean/Squares (LZM/LZMS) and related metrics, with careful attention to binning and confidence intervals. The QM9 atomization energy example demonstrates good average calibration but suboptimal adaptivity, highlighting the practical importance of testing adaptivity alongside consistency. The work provides actionable diagnostics (LZMS, f_v,ZMS, bootstrap CIs, stratified binning) and an LRCE alternative to quantify local calibration, offering a structured pathway for more reliable molecule- and material-specific uncertainty estimates in ML-UQ.
Abstract
Reliable uncertainty quantification (UQ) in machine learning (ML) regression tasks is becoming the focus of many studies in materials and chemical science. It is now well understood that average calibration is insufficient, and most studies implement additional methods testing the conditional calibration with respect to uncertainty, i.e. consistency. Consistency is assessed mostly by so-called reliability diagrams. There exists however another way beyond average calibration, which is conditional calibration with respect to input features, i.e. adaptivity. In practice, adaptivity is the main concern of the final users of a ML-UQ method, seeking for the reliability of predictions and uncertainties for any point in features space. This article aims to show that consistency and adaptivity are complementary validation targets, and that a good consistency does not imply a good adaptivity. Adapted validation methods are proposed and illustrated on a representative example.
