Table of Contents
Fetching ...

Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

Matthias Holzenkamp, Dongyu Lyu, Ulrich Kleinekathöfer, Peter Zaspel

TL;DR

The study evaluates uncertainty estimations for Gaussian process regression (GPR)–based machine learning interatomic potentials (MLIPs) using Coulomb and SOAP representations to predict molecular energies. It compares predictive uncertainty from the GPR standard deviation with ensemble-based uncertainties and assesses calibration through global calibration curves and extended reliability diagrams, as well as the impact on active learning via uncertainty sampling. The findings show that ensemble uncertainties are globally poorly calibrated, while the GPR standard deviation is globally better calibrated but exhibits local biases in high-uncertainty regions, limiting its use as a quantitative error interval. Uncertainty sampling in a fixed configuration space often worsens average performance, although high-uncertainty selections can improve extrapolation by pushing the model to cover border regions, highlighting a trade-off between data efficiency and representative coverage. Overall, the work provides a nuanced view of when GPR uncertainties are informative and how active-learning strategies should be designed for GPR-based MLIPs.

Abstract

Uncertainty estimations for machine learning interatomic potentials (MLIPs) are crucial for quantifying model error and identifying informative training samples in active learning strategies. In this study, we evaluate uncertainty estimations of Gaussian process regression (GPR)-based MLIPs, including the predictive GPR standard deviation and ensemble-based uncertainties. We do this in terms of calibration and in terms of impact on model performance in an active learning scheme. We consider GPR models with Coulomb and Smooth Overlap of Atomic Positions (SOAP) representations as inputs to predict potential energy surfaces and excitation energies of molecules. Regarding calibration, we find that ensemble-based uncertainty estimations show already poor global calibration (e.g., averaged over the whole test set). In contrast, the GPR standard deviation shows good global calibration, but when grouping predictions by their uncertainty, we observe a systematical bias for predictions with high uncertainty. Although an increasing uncertainty correlates with an increasing bias, the bias is not captured quantitatively by the uncertainty. Therefore, the GPR standard deviation can be useful to identify predictions with a high bias and error but, without further knowledge, should not be interpreted as a quantitative measure for a potential error range. Selecting the samples with the highest GPR standard deviation from a fixed configuration space leads to a model that overemphasizes the borders of the configuration space represented in the fixed dataset. This may result in worse performance in more densely sampled areas but better generalization for extrapolation tasks.

Evaluation of uncertainty estimations for Gaussian process regression based machine learning interatomic potentials

TL;DR

The study evaluates uncertainty estimations for Gaussian process regression (GPR)–based machine learning interatomic potentials (MLIPs) using Coulomb and SOAP representations to predict molecular energies. It compares predictive uncertainty from the GPR standard deviation with ensemble-based uncertainties and assesses calibration through global calibration curves and extended reliability diagrams, as well as the impact on active learning via uncertainty sampling. The findings show that ensemble uncertainties are globally poorly calibrated, while the GPR standard deviation is globally better calibrated but exhibits local biases in high-uncertainty regions, limiting its use as a quantitative error interval. Uncertainty sampling in a fixed configuration space often worsens average performance, although high-uncertainty selections can improve extrapolation by pushing the model to cover border regions, highlighting a trade-off between data efficiency and representative coverage. Overall, the work provides a nuanced view of when GPR uncertainties are informative and how active-learning strategies should be designed for GPR-based MLIPs.

Abstract

Uncertainty estimations for machine learning interatomic potentials (MLIPs) are crucial for quantifying model error and identifying informative training samples in active learning strategies. In this study, we evaluate uncertainty estimations of Gaussian process regression (GPR)-based MLIPs, including the predictive GPR standard deviation and ensemble-based uncertainties. We do this in terms of calibration and in terms of impact on model performance in an active learning scheme. We consider GPR models with Coulomb and Smooth Overlap of Atomic Positions (SOAP) representations as inputs to predict potential energy surfaces and excitation energies of molecules. Regarding calibration, we find that ensemble-based uncertainty estimations show already poor global calibration (e.g., averaged over the whole test set). In contrast, the GPR standard deviation shows good global calibration, but when grouping predictions by their uncertainty, we observe a systematical bias for predictions with high uncertainty. Although an increasing uncertainty correlates with an increasing bias, the bias is not captured quantitatively by the uncertainty. Therefore, the GPR standard deviation can be useful to identify predictions with a high bias and error but, without further knowledge, should not be interpreted as a quantitative measure for a potential error range. Selecting the samples with the highest GPR standard deviation from a fixed configuration space leads to a model that overemphasizes the borders of the configuration space represented in the fixed dataset. This may result in worse performance in more densely sampled areas but better generalization for extrapolation tasks.

Paper Structure

This paper contains 21 sections, 19 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Extended reliability diagram for sinus data with synthetic GPR test targets. To generate the data, a GPR model was trained with noisy data points from a sinus function. Test targets where drawn from the predictive distributions of the GPR model. Thereby data is generated for which the GPR model perfectly captures the underlying data-generating process. For all test samples, the predicted GPR standard deviation is plotted against the actual error of the prediction. All predictions are separated by their standard deviation into equidistant bins. Mean and standard deviation of error distributions are shown for every bin.
  • Figure 2: Chemical structures of the molecules used in the tests.
  • Figure 3: Calibration curves of different uncertainty measures of GPR with SOAP for rMD17 benzene and WS22 SMA. Calibration curves plot the predicted proportion of test data expected to fall within $\alpha$-prediction intervals (horizontal axis) against the observed proportion of test data that actually falls within these intervals (vertical axis), iterating over different values of $\alpha$.
  • Figure 4: Calibration curves and extended reliability diagrams of the GPR standard deviation of GPR with Coulomb applied to different datasets. For the extended reliability diagrams, samples were separated into equidistant bins by their uncertainty, and the mean and standard deviation of the errors of all samples in one bin were calculated and are compared with the theoretical values from Eq. \ref{['eq:error_dist']}.
  • Figure 5: Uncertainty sampling results for GPR with Coulomb and SOAP representations for different datasets. We randomly selected 200 samples as initial training samples and 2000 samples as test samples. From the remaining pool of samples, in every iteration, the sample with the highest respective uncertainty was selected and added as an additional training sample. Additionally we randomly added a new sample in every iteration and we added the sample with the highest actual absolute error. During each iteration, we compute the MAE.
  • ...and 11 more figures