Table of Contents
Fetching ...

Evaluating Uncertainty in Deep Gaussian Processes

Matthijs van der Lende, Jeremias Lino Ferrao, Niclas Müller-Hof

TL;DR

This work evaluates deep Gaussian processes (DGPs) and deep sigma point processes (DSPPs) against Deep Ensembles for uncertainty quantification in regression and classification, focusing on calibration (NLL, ECE) and robustness under synthetic distribution shifts. DSPPs show strong in-distribution calibration via sigma-point quadrature, while ensembles exhibit superior robustness to shifts across tasks. DGPs lag in calibration or robustness depending on the dataset, underscoring that good in-distribution calibration does not guarantee shift resilience. The study provides a baseline, reveals trade-offs between calibration and robustness, and offers code to facilitate reproducibility and future benchmarking with broader likelihoods and domains.

Abstract

Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at https://github.com/matthjs/xai-gp.

Evaluating Uncertainty in Deep Gaussian Processes

TL;DR

This work evaluates deep Gaussian processes (DGPs) and deep sigma point processes (DSPPs) against Deep Ensembles for uncertainty quantification in regression and classification, focusing on calibration (NLL, ECE) and robustness under synthetic distribution shifts. DSPPs show strong in-distribution calibration via sigma-point quadrature, while ensembles exhibit superior robustness to shifts across tasks. DGPs lag in calibration or robustness depending on the dataset, underscoring that good in-distribution calibration does not guarantee shift resilience. The study provides a baseline, reveals trade-offs between calibration and robustness, and offers code to facilitate reproducibility and future benchmarking with broader likelihoods and domains.

Abstract

Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at https://github.com/matthjs/xai-gp.

Paper Structure

This paper contains 23 sections, 32 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Training and validation loss curves of our optimized models on CASP and ESR
  • Figure 2: Calibration curves of our models for the Protein regression dataset.
  • Figure 3: Calibration curves of our models for the Epileptic Sezure Recognition dataset.
  • Figure 4: Box plots showing ECE under distributional shift for the CASP regression (left) and ESR classification (right) tasks across three methods: Deep Ensemble (blue), DGP (orange), and DSPP (green).
  • Figure 5: Negative log likelihood on test set against the number of inducing points. Left: CASP, Right: ESR
  • ...and 2 more figures