Table of Contents
Fetching ...

Another Fit Bites the Dust: Conformal Prediction as a Calibration Standard for Machine Learning in High-Energy Physics

Jack Y. Araz, Michael Spannowsky

TL;DR

The paper argues that conformal prediction provides a distribution-free, finite-sample calibration layer for diverse ML tasks in high-energy physics. It demonstrates how CP can convert arbitrary model outputs into calibrated prediction sets, p-values, or typicality regions across regression, binary and multiclass classification, anomaly detection, and generative modelling, using public collider datasets. CP yields rigorous uncertainty quantification without retraining or altering underlying models, enabling honest error control and robust comparisons. The authors advocate adopting CP as a standard post-processing step in collider ML pipelines to improve interpretability and decision-making under controlled error rates.

Abstract

Machine-learning techniques are essential in modern collider research, yet their probabilistic outputs often lack calibrated uncertainty estimates and finite-sample guarantees, limiting their direct use in statistical inference and decision-making. Conformal prediction (CP) provides a simple, distribution-free framework for calibrating arbitrary predictive models without retraining, yielding rigorous uncertainty quantification with finite-sample coverage guarantees under minimal exchangeability assumptions, without reliance on asymptotics, limit theorems, or Gaussian approximations. In this work, we investigate CP as a unifying calibration layer for machine-learning applications in high-energy physics. Using publicly available collider datasets and a diverse set of models, we show that a single conformal formalism can be applied across regression, binary and multi-class classification, anomaly detection, and generative modelling, converting raw model outputs into statistically valid prediction sets, typicality regions, and p-values with controlled false-positive rates. While conformal prediction does not improve raw model performance, it enforces honest uncertainty quantification and transparent error control. We argue that conformal calibration should be adopted as a standard component of machine-learning pipelines in collider physics, enabling reliable interpretation, robust comparisons, and principled statistical decisions in experimental and phenomenological analyses.

Another Fit Bites the Dust: Conformal Prediction as a Calibration Standard for Machine Learning in High-Energy Physics

TL;DR

The paper argues that conformal prediction provides a distribution-free, finite-sample calibration layer for diverse ML tasks in high-energy physics. It demonstrates how CP can convert arbitrary model outputs into calibrated prediction sets, p-values, or typicality regions across regression, binary and multiclass classification, anomaly detection, and generative modelling, using public collider datasets. CP yields rigorous uncertainty quantification without retraining or altering underlying models, enabling honest error control and robust comparisons. The authors advocate adopting CP as a standard post-processing step in collider ML pipelines to improve interpretability and decision-making under controlled error rates.

Abstract

Machine-learning techniques are essential in modern collider research, yet their probabilistic outputs often lack calibrated uncertainty estimates and finite-sample guarantees, limiting their direct use in statistical inference and decision-making. Conformal prediction (CP) provides a simple, distribution-free framework for calibrating arbitrary predictive models without retraining, yielding rigorous uncertainty quantification with finite-sample coverage guarantees under minimal exchangeability assumptions, without reliance on asymptotics, limit theorems, or Gaussian approximations. In this work, we investigate CP as a unifying calibration layer for machine-learning applications in high-energy physics. Using publicly available collider datasets and a diverse set of models, we show that a single conformal formalism can be applied across regression, binary and multi-class classification, anomaly detection, and generative modelling, converting raw model outputs into statistically valid prediction sets, typicality regions, and p-values with controlled false-positive rates. While conformal prediction does not improve raw model performance, it enforces honest uncertainty quantification and transparent error control. We argue that conformal calibration should be adopted as a standard component of machine-learning pipelines in collider physics, enabling reliable interpretation, robust comparisons, and principled statistical decisions in experimental and phenomenological analyses.

Paper Structure

This paper contains 16 sections, 16 equations, 18 figures.

Figures (18)

  • Figure 1: Schematic overview of conformal prediction as a universal calibration layer for HEPML. A model-specific, uncalibrated output is combined with a calibration sample $\mathcal{D}_{\rm cal}=\{(X_i,Y_i)\}_{i=1}^{n_{\rm cal}}$ through a chosen nonconformity score $s(x,y)$ to determine the split-conformal threshold $\hat{q}_{1-\alpha}$. This results in calibrated objects with finite-sample marginal guarantees: prediction intervals in regression, label sets $\Gamma_\alpha(x)\subseteq\{1,\dots,K\}$ in classification, and calibrated anomaly or generative-model discrepancies that can be expressed as conformal $p$-values or, equivalently, as threshold scores.
  • Figure 2: Predictive intervals for a heteroscedastic synthetic regression task. The blue curve shows the ground-truth $f(x)=\sin(4\pi x)$ and blue points are test samples generated with input-dependent noise $\sigma(x)=0.1+0.6x$. Solid red curves denote the predictive mean, and the shaded orange regions represent prediction intervals. Panel (a) shows split conformal prediction applied to a Random Forest regressor. Panel (b) shows conformalised Quantile Regression (CQR) using two Gradient-Boosting quantile models. Panel (c) shows the effect of Gaussian Process (GP) intervals. Panel (d) shows conformal calibration of the GP using standardised residual scores. Panel (e) shows adaptive Conformal Prediction (ACP).
  • Figure 3: Panel (a) shows ROC for PFN (Particle Flow Network), a PFN variant (PFIN; see the text for architecture), and Minimal Basis for 8-subjettiness (MB8S). The top of panel (b) shows empirical coverage on the test sample as a function of nominal coverage $1-\alpha$ for the three base classifiers (PFN, PFIN, MB8S), where bottom panel shows mean prediction-set size $\mathbb{E}[\,|\Gamma_\alpha(x)|\,]$ versus $1-\alpha$.
  • Figure 4: Conditional coverage by jet mass for PFN at $\alpha=0.1$ using the score $S(x,y)=1-p_\theta(y\mid x)$. The red dashed line shows the nominal target $1-\alpha=0.9$ and hatched caps indicate $90\%$ binomial confidence intervals per bin.
  • Figure 5: Global conformal prediction performance for multi-class classification using the Omnilearn model. The blue curve shows empirical test-set coverage as a function of the nominal target $1-\alpha$, and the dashed black line shows the target coverage. The red curve shows the corresponding average prediction-set size $\mathbb{E}[|\Gamma_\alpha(x)|]$. Vertical dashed lines mark two representative confidence levels at $68\%$ and $95\%$ CL.
  • ...and 13 more figures