Table of Contents
Fetching ...

Parameter uncertainties for imperfect surrogate models in the low-noise regime

Thomas D Swinburne, Danny Perez

TL;DR

This work analyzes the generalization error of misspecified, near-deterministic surrogate models, a regime of broad relevance in science and engineering, and shows posterior distributions must cover every training point to avoid a divergent generalization error.

Abstract

Bayesian regression determines model parameters by minimizing the expected loss, an upper bound to the true generalization error. However, the loss ignores misspecification, where models are imperfect. Parameter uncertainties from Bayesian regression are thus significantly underestimated and vanish in the large data limit. This is particularly problematic when building models of low-noise, or near-deterministic, calculations, as the main source of uncertainty is neglected. We analyze the generalization error of misspecified, near-deterministic surrogate models, a regime of broad relevance in science and engineering. We show posterior distributions must cover every training point to avoid a divergent generalization error and design an ansatz that respects this constraint, which for linear models incurs minimal overhead. This is demonstrated on model problems before application to thousand dimensional datasets in atomistic machine learning. Our efficient misspecification-aware scheme gives accurate prediction and bounding of test errors where existing schemes fail, allowing this important source of uncertainty to be incorporated in computational workflows.

Parameter uncertainties for imperfect surrogate models in the low-noise regime

TL;DR

This work analyzes the generalization error of misspecified, near-deterministic surrogate models, a regime of broad relevance in science and engineering, and shows posterior distributions must cover every training point to avoid a divergent generalization error.

Abstract

Bayesian regression determines model parameters by minimizing the expected loss, an upper bound to the true generalization error. However, the loss ignores misspecification, where models are imperfect. Parameter uncertainties from Bayesian regression are thus significantly underestimated and vanish in the large data limit. This is particularly problematic when building models of low-noise, or near-deterministic, calculations, as the main source of uncertainty is neglected. We analyze the generalization error of misspecified, near-deterministic surrogate models, a regime of broad relevance in science and engineering. We show posterior distributions must cover every training point to avoid a divergent generalization error and design an ansatz that respects this constraint, which for linear models incurs minimal overhead. This is demonstrated on model problems before application to thousand dimensional datasets in atomistic machine learning. Our efficient misspecification-aware scheme gives accurate prediction and bounding of test errors where existing schemes fail, allowing this important source of uncertainty to be incorporated in computational workflows.
Paper Structure (26 sections, 36 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 26 sections, 36 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Regression of a deterministic quadratic polynomial ($P=3$) model onto a sinusoidal "simulation engine", trained on $N=100$ points. Top left: mean and $3\sigma$ interval from Bayesian ridge regression. All other plots show mean and max/min range. Top: POPS-ensemble and POPS-hypercube ansatz$\pi^*_E$ and $\pi^*_\mathcal{H}$. Bottom: numerically optimized $\mathcal{G}_E$ for a uniformly weighted $N$-ensemble, regularized with $\Sigma_\mathcal{Y}=\sigma\Sigma^*_\mathcal{L}$, for $\sigma=1,1/3,1/6$. Lower $\sigma$ values gave numerical instabilities and are not presented (see appendix \ref{['app:ensemble']}).
  • Figure 2: Test errors of a misspecified linear surrogate model on a cubic simulation engine. Left: test error histogram at $P=20$, $N/P=100$ for the minimum loss model (black) and predictions from the POPS-ensemble $\pi^*_E$ (orange) and POPS-hypercube $\pi^*_\mathcal{H}$ (green) ansatz. MAE: mean absolute error relative to the minimum loss solution. EV: envelope violation, points lying outside of the max/min bound. Right: Probability of envelope violation for the $\pi^*_\mathcal{H}$ansatz with $P$ and $N/P$.
  • Figure 3: Fitting a qSNAPwood2018extending interatomic potential to a diverse tungsten datasetkarabin2020entropy. Left: representative training configurationsmontes2022training. Center: test error histogram at $P=1596$, $N/P=81$ for the minimum loss model (black) and predictions from the POPS-hypercube $\pi^*_\mathcal{H}$ (green) ansatz. MAE: mean absolute error relative to the minimum loss solution. EV: envelope violation, points lying outside of the max/min bound. Right: Probability of envelope violation for the $\pi^*_\mathcal{H}$ansatz with $N/P$.
  • Figure 4: Fitting a SNAPwood2018extending interatomic potential to a NbMoTaW high-entropy alloy (HEA) training setli2020complex. Far Left: representative HEA configurations. Left: test error histogram at $P=120$, $N/P=376$ for the minimum loss model (black) and predictions from the POPS-hypercube $\pi^*_\mathcal{H}$ (green) ansatz. MAE: mean absolute error relative to the minimum loss solution. EV: envelope violation, points lying outside of the max/min bound. Right: probability of envelope violation for the $\pi^*_\mathcal{H}$ansatz with $N/P$. Far right: correlation of actual vs predicted error in the test set. See main text.
  • Figure 5: Fitting a linear graphlet modeltynes_graphlet_2024 to predict energies from the QM9 datasetqm9_reference. Left: representative small organic molecules. Center: test error histogram at $P=247$, $N/P=42$ for the minimum loss model (black) and predictions from the POPS-hypercube $\pi^*_\mathcal{H}$ (green) ansatz. MAE: mean absolute error relative to the minimum loss solution. EV: envelope violation, points lying outside of the max/min bound. Right: Probability of envelope violation for the $\pi^*_\mathcal{H}$ansatz with $N/P$.
  • ...and 3 more figures