Table of Contents
Fetching ...

Uncertainty Quantification of Surrogate Models using Conformal Prediction

Vignesh Gopakumar, Ander Gray, Joel Oskarsson, Lorenzo Zanisi, Daniel Giles, Matt J. Kusner, Stanislas Pamela, Marc Peter Deisenroth

TL;DR

This work develops a model-agnostic conformal prediction framework to quantify uncertainty in data-driven spatio-temporal surrogates for complex physical systems. By performing cell-wise calibration on tensor outputs, it delivers statistically valid marginal coverage across space and time with near-zero calibration cost, across deterministic and probabilistic models, and even under out-of-distribution deployment within exchangeability assumptions. The authors audit CP with three nonconformity scores (CQR, AER, STD) over a wide suite of tasks—1D and 2D PDEs, Navier–Stokes and MHD plasmas, foundation physics models, and neural weather prediction—demonstrating robust coverage up to tens of millions of output dimensions. They further discuss exchangeability requirements, practical limitations (marginal vs conditional coverage, independence across cells, and potential distribution shifts), and provide guidelines for using CP to validate pre-trained surrogates for safety-critical inference with minimal retraining. Overall, CP emerges as a scalable, principled tool for trustworthy deployment of scientific ML models where confident uncertainty quantification is essential but traditional UQ methods are prohibitive.

Abstract

Data-driven surrogate models offer quick approximations to complex numerical and experimental systems but typically lack uncertainty quantification, limiting their reliability in safety-critical applications. While Bayesian methods provide uncertainty estimates, they offer no statistical guarantees and struggle with high-dimensional spatio-temporal problems due to computational costs. We present a conformal prediction (CP) framework that provides statistically guaranteed marginal coverage for surrogate models in a model-agnostic manner with near-zero computational cost. Our approach handles high-dimensional spatio-temporal outputs by performing cell-wise calibration while preserving the tensorial structure of predictions. Through extensive empirical evaluation across diverse applications including fluid dynamics, magnetohydrodynamics, weather forecasting, and fusion diagnostics, we demonstrate that CP achieves empirical coverage with valid error bars regardless of model architecture, training regime, or output dimensionality. We evaluate three nonconformity scores (conformalised quantile regression, absolute error residual, and standard deviation) for both deterministic and probabilistic models, showing that guaranteed coverage holds even for out-of-distribution predictions where models are deployed on physics regimes different from training data. Calibration requires only seconds to minutes on standard hardware. The framework enables rigorous validation of pre-trained surrogate models for downstream applications without retraining. While CP provides marginal rather than conditional coverage and assumes exchangeability between calibration and test data, our method circumvents the curse of dimensionality inherent in traditional uncertainty quantification approaches, offering a practical tool for trustworthy deployment of machine learning in physical sciences.

Uncertainty Quantification of Surrogate Models using Conformal Prediction

TL;DR

This work develops a model-agnostic conformal prediction framework to quantify uncertainty in data-driven spatio-temporal surrogates for complex physical systems. By performing cell-wise calibration on tensor outputs, it delivers statistically valid marginal coverage across space and time with near-zero calibration cost, across deterministic and probabilistic models, and even under out-of-distribution deployment within exchangeability assumptions. The authors audit CP with three nonconformity scores (CQR, AER, STD) over a wide suite of tasks—1D and 2D PDEs, Navier–Stokes and MHD plasmas, foundation physics models, and neural weather prediction—demonstrating robust coverage up to tens of millions of output dimensions. They further discuss exchangeability requirements, practical limitations (marginal vs conditional coverage, independence across cells, and potential distribution shifts), and provide guidelines for using CP to validate pre-trained surrogates for safety-critical inference with minimal retraining. Overall, CP emerges as a scalable, principled tool for trustworthy deployment of scientific ML models where confident uncertainty quantification is essential but traditional UQ methods are prohibitive.

Abstract

Data-driven surrogate models offer quick approximations to complex numerical and experimental systems but typically lack uncertainty quantification, limiting their reliability in safety-critical applications. While Bayesian methods provide uncertainty estimates, they offer no statistical guarantees and struggle with high-dimensional spatio-temporal problems due to computational costs. We present a conformal prediction (CP) framework that provides statistically guaranteed marginal coverage for surrogate models in a model-agnostic manner with near-zero computational cost. Our approach handles high-dimensional spatio-temporal outputs by performing cell-wise calibration while preserving the tensorial structure of predictions. Through extensive empirical evaluation across diverse applications including fluid dynamics, magnetohydrodynamics, weather forecasting, and fusion diagnostics, we demonstrate that CP achieves empirical coverage with valid error bars regardless of model architecture, training regime, or output dimensionality. We evaluate three nonconformity scores (conformalised quantile regression, absolute error residual, and standard deviation) for both deterministic and probabilistic models, showing that guaranteed coverage holds even for out-of-distribution predictions where models are deployed on physics regimes different from training data. Calibration requires only seconds to minutes on standard hardware. The framework enables rigorous validation of pre-trained surrogate models for downstream applications without retraining. While CP provides marginal rather than conditional coverage and assumes exchangeability between calibration and test data, our method circumvents the curse of dimensionality inherent in traditional uncertainty quantification approaches, offering a practical tool for trustworthy deployment of machine learning in physical sciences.
Paper Structure (86 sections, 21 equations, 37 figures, 10 tables, 1 algorithm)

This paper contains 86 sections, 21 equations, 37 figures, 10 tables, 1 algorithm.

Figures (37)

  • Figure 1: Inductive CP framework using Absolute Error Residual (AER) nonconformity scores (see \ref{['nonconformity scores']}): (1) Calibrate by computing nonconformity scores ($\hat{s}$) from calibration predictions ($\tilde{y}_c$) and targets ($y_c$). (2) Estimate the quantile ($\hat{q}$) for desired coverage $(1-\alpha)$ using $n$ calibration samples and the inverse CDF $F_{\hat{s}}^{-1}$. (3) Construct prediction sets by applying $\hat{q}$ to test predictions ($\tilde{y}_p$).
  • Figure 2: Cell-wise uncertainty calibration using CP for a U-Net modelling the wave equation in an out-of-distribution setting (\ref{['sec: wave-unet']}). Rows show: ground truth, model prediction, uncalibrated 95% coverage from MC dropout ($2\sigma$), and calibrated 95% coverage ($\alpha=0.05$) from CP. Cell-wise calibration provides guaranteed coverage for each spatial location. MC dropout produces unrealistically small uncertainties, while CP error bars correctly identify regions of high uncertainty corresponding to complex dynamics.
  • Figure 3: Empirical coverage versus target coverage $(1-\alpha)$ across experiments and nonconformity scores. The diagonal represents ideal coverage. All methods achieve near-perfect coverage across four PDE experiments, validating \ref{['eq: coverage']}.
  • Figure 4: Constructing exchangeable pairs for simulation versus experimental data. Simulation-based: Multiple independent runs with varying initial conditions at $t=0$, each forecasting the same time horizon to $t=T$. Experimental data: Single long-duration experiment partitioned into multiple IVPs with different starting times $T_i$ from a larger temporal domain, each forecasting the same duration $\Delta T$.
  • Figure 5: Calibrated prediction sets ($\alpha=0.1$, 90% coverage) for the 1D Poisson equation using three nonconformity scores (CQR, AER, STD). The simple dynamics enable near-perfect model fit, resulting in tight, well-calibrated error bars. Ground truth (black line), model prediction (blue line), and shaded regions depict the prediction sets.
  • ...and 32 more figures