Table of Contents
Fetching ...

Design-marginal calibration of Gaussian process predictive distributions: Bayesian and conformal approaches

Aurélien Pion, Emmanuel Vazquez

TL;DR

This work tackles the challenge of calibrating Gaussian-process predictive distributions under interpolation by introducing design-marginal notions of calibration (μ-calibration). It develops two calibration frameworks: cps-gp, a conformal-prediction-based method yielding a distribution-free, marginally calibrated CPD for GP interpolation, and bcr-gp, a Bayesian post-processing approach that preserves the GP mean but recalibrates dispersion via a generalized normal residual model. The methods are compared against existing conformal techniques (Jackknife+ and full conformal GP) using μ-coverage, PIT-based diagnostics, and proper scoring rules, showing improved calibration and usable predictive distributions for sequential design. The paper also provides extensive theoretical results, finite-sample considerations, and practical guidance on parameter selection, design-size effects, and tail calibration, highlighting the trade-offs between model-based calibration and distribution-free guarantees. Overall, cps-gp and bcr-gp offer complementary tools to enhance the reliability of GP-based uncertainty quantification in design and optimization tasks.

Abstract

We study the calibration of Gaussian process (GP) predictive distributions in the interpolation setting from a design-marginal perspective. Conditioning on the data and averaging over a design measure μ, we formalize μ-coverage for central intervals and μ-probabilistic calibration through randomized probability integral transforms. We introduce two methods. cps-gp adapts conformal predictive systems to GP interpolation using standardized leave-one-out residuals, yielding stepwise predictive distributions with finite-sample marginal calibration. bcr-gp retains the GP posterior mean and replaces the Gaussian residual by a generalized normal model fitted to cross-validated standardized residuals. A Bayesian selection rule-based either on a posterior upper quantile of the variance for conservative prediction or on a cross-posterior Kolmogorov-Smirnov criterion for probabilistic calibration-controls dispersion and tail behavior while producing smooth predictive distributions suitable for sequential design. Numerical experiments on benchmark functions compare cps-gp, bcr-gp, Jackknife+ for GPs, and the full conformal Gaussian process, using calibration metrics (coverage, Kolmogorov-Smirnov, integral absolute error) and accuracy or sharpness through the scaled continuous ranked probability score.

Design-marginal calibration of Gaussian process predictive distributions: Bayesian and conformal approaches

TL;DR

This work tackles the challenge of calibrating Gaussian-process predictive distributions under interpolation by introducing design-marginal notions of calibration (μ-calibration). It develops two calibration frameworks: cps-gp, a conformal-prediction-based method yielding a distribution-free, marginally calibrated CPD for GP interpolation, and bcr-gp, a Bayesian post-processing approach that preserves the GP mean but recalibrates dispersion via a generalized normal residual model. The methods are compared against existing conformal techniques (Jackknife+ and full conformal GP) using μ-coverage, PIT-based diagnostics, and proper scoring rules, showing improved calibration and usable predictive distributions for sequential design. The paper also provides extensive theoretical results, finite-sample considerations, and practical guidance on parameter selection, design-size effects, and tail calibration, highlighting the trade-offs between model-based calibration and distribution-free guarantees. Overall, cps-gp and bcr-gp offer complementary tools to enhance the reliability of GP-based uncertainty quantification in design and optimization tasks.

Abstract

We study the calibration of Gaussian process (GP) predictive distributions in the interpolation setting from a design-marginal perspective. Conditioning on the data and averaging over a design measure μ, we formalize μ-coverage for central intervals and μ-probabilistic calibration through randomized probability integral transforms. We introduce two methods. cps-gp adapts conformal predictive systems to GP interpolation using standardized leave-one-out residuals, yielding stepwise predictive distributions with finite-sample marginal calibration. bcr-gp retains the GP posterior mean and replaces the Gaussian residual by a generalized normal model fitted to cross-validated standardized residuals. A Bayesian selection rule-based either on a posterior upper quantile of the variance for conservative prediction or on a cross-posterior Kolmogorov-Smirnov criterion for probabilistic calibration-controls dispersion and tail behavior while producing smooth predictive distributions suitable for sequential design. Numerical experiments on benchmark functions compare cps-gp, bcr-gp, Jackknife+ for GPs, and the full conformal Gaussian process, using calibration metrics (coverage, Kolmogorov-Smirnov, integral absolute error) and accuracy or sharpness through the scaled continuous ranked probability score.

Paper Structure

This paper contains 82 sections, 14 theorems, 191 equations, 11 figures, 4 tables.

Key Result

Proposition 3.1

Let $V \sim \hat{F}_n(\cdot\mid x)$, $x\in\mathbb{X}$, be independent of $(\tau,\,\mathcal{D}_n)$, and set $U :=\hat{F}_{n,\,\tau}(V \mid x)$. Then, $U \mid \mathcal{D}_n\sim \mathcal{U}(0,1)$. Moreover, the half-open randomized interval satisfies

Figures (11)

  • Figure 1: Trade-off between predictive accuracy (RMSE) and calibration quality (KS--PIT, smaller is better, see Section \ref{['sec:background']}) for a uniform random sample of GP kernel hyperparameters (red points). We interpolate the Goldstein--Price function using a GP with constant mean and Matérn covariance. The left panel shows metrics computed by leave-one-out (LOO) on the observation set (150 points), and the right panel shows metrics computed on an independent test set drawn from $\mu$ (1500 points). The ML-selected hyperparameters (blue square) yield accurate but poorly calibrated predictions. Post-processed predictors using cps--gp (green symbol) and bcr--gp (gold star) improve calibration on the test set without degrading accuracy. The hatched region corresponds to RMSE--KS--PIT pairs that cannot be attained by any GP posterior under the considered GP family.
  • Figure 1: First row: PIT histograms with the uniform density (dashed line) as reference. A $\cup$-shaped PIT (mass near $0$ and $1$) indicates underdispersion; predictive intervals are too narrow and observations fall outside too often (optimistic coverage). A $\cap$-shaped PIT (mass near $1/2$) indicates overdispersion; intervals are too wide and observations fall inside too often (pessimistic coverage). Second row (rotated view): vertical axis is $z$. Bottom axis shows the predictive CDF $u=F_{\mathrm{pred}}(z)$ (shaded area), while top axis shows the density scale: horizontal empirical histogram (outline) and predictive pdf (dashed). A horizontal slice at a given $z$ maps to a CDF value $u$ on the bottom axis, which is precisely the PIT value contributing to the histogram in the first row. The empirical sample is drawn from the standard normal distribution $\mathcal{N}(0,1)$. Predictive distributions are normal with the same mean but different scales: $\mathcal{N}(0,0.5^{2})$ (underdispersed) and $\mathcal{N}(0,2^{2})$ (overdispersed).
  • Figure 1: Stepwise CPD (with $\tau=0.5$) compared to the Gaussian posterior CDF at $x_{n+1}=0$. The hyperparameters of the GP are fitted on $\mathcal{D}_n$ and then kept fixed. The CPD has discrete jumps at thresholds determined by GP interpolation, while the Gaussian posterior yields a smooth curve.
  • Figure 1: Empirical quantiles of standardized residuals $R = (Z - m_n(X))/\sigma_n(X)$ at $1000$ test points, compared with those of the standard normal and a fitted generalized normal distribution. Shaded bands indicate $99\%$ pointwise confidence intervals. Example: Goldstein--Price function in $d=2$, $n=40$.
  • Figure 1: Top left: Prediction intervals constructed from the GP posterior distribution and from bcr--gp at confidence level $1 - \alpha = 0.9$. bcr--gp uses the variance of the generalized normal distribution for selection, with $\delta = 0.1$. Top right: Predicted CDFs at $x = 0.75$, comparing the GP posterior, the CDF from bcr--gp (red), the stepwise CDF from cps--gp (black hatches), and an oracle CDF (black) obtained from a generalized normal model fitted on a test grid of $n_{\mathrm{test}} = 2000$ points. This dataset exhibits strong miscalibration of the GP posterior predictive distributions. Bottom: Interval bounds as a function of $1 - \alpha$. The GP posterior underestimates uncertainty across confidence levels, while bcr--gp and cps--gp produce larger intervals. cps--gp yields unbounded interval widths for $1 - \alpha \gtrsim 0.85$.
  • ...and 6 more figures

Theorems & Definitions (42)

  • Proposition 3.1: Boundary randomization and exact interval mass
  • Proof 1
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Proposition 3.5
  • Proof 2
  • Remark 3.6: On reuse of observation points
  • Proposition 3.7: IAE bounded by KS--PIT
  • Proof 3
  • ...and 32 more