Table of Contents
Fetching ...

Fast and Accurate Uncertainty Estimation in Chemical Machine Learning

Felix Musil, Michael J. Willatt, Mikhail A. Langovoy, Michele Ceriotti

TL;DR

This work presents a scalable framework for uncertainty estimation in chemical machine learning by combining sparse Gaussian Process Regression (PP) with SOAP kernels and resampling (sub-sampling) to generate ensembles of predictions. Uncertainty is calibrated via log-likelihood maximization and a maximum-likelihood scaling factor, improving reliability beyond the standard GP variance. The framework is validated on two benchmarks: 1H NMR chemical shieldings in molecular crystals and QM9 formation energies, showing that sub-sampling-based uncertainty (especially with non-linear scaling) can outperform or match the GPR-based uncertainty at reduced cost, and enabling robust uncertainty propagation for derived properties. The approach supports training-set optimization and active learning, and is readily adaptable to other ML schemes, providing practical benefits for data-driven materials chemistry.

Abstract

We present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties. The scheme is based on resampling, with multiple models being generated based on sub-sampling of the same training data. The accuracy of the uncertainty prediction can be benchmarked by maximum likelihood estimation, which can also be used to correct for correlations between resampled models, and to improve the performance of the uncertainty estimation by a cross-validation procedure. In the case of sparse Gaussian Process Regression models, this resampled estimator can be evaluated at negligible cost. We demonstrate the reliability of these estimates for the prediction of molecular energetics, and for the estimation of nuclear chemical shieldings in molecular crystals. Extension to estimate the uncertainty in energy differences, forces, or other correlated predictions is straightforward. This method can be easily applied to other machine learning schemes, and will be beneficial to make data-driven predictions more reliable, and to facilitate training-set optimization and active-learning strategies.

Fast and Accurate Uncertainty Estimation in Chemical Machine Learning

TL;DR

This work presents a scalable framework for uncertainty estimation in chemical machine learning by combining sparse Gaussian Process Regression (PP) with SOAP kernels and resampling (sub-sampling) to generate ensembles of predictions. Uncertainty is calibrated via log-likelihood maximization and a maximum-likelihood scaling factor, improving reliability beyond the standard GP variance. The framework is validated on two benchmarks: 1H NMR chemical shieldings in molecular crystals and QM9 formation energies, showing that sub-sampling-based uncertainty (especially with non-linear scaling) can outperform or match the GPR-based uncertainty at reduced cost, and enabling robust uncertainty propagation for derived properties. The approach supports training-set optimization and active learning, and is readily adaptable to other ML schemes, providing practical benefits for data-driven materials chemistry.

Abstract

We present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties. The scheme is based on resampling, with multiple models being generated based on sub-sampling of the same training data. The accuracy of the uncertainty prediction can be benchmarked by maximum likelihood estimation, which can also be used to correct for correlations between resampled models, and to improve the performance of the uncertainty estimation by a cross-validation procedure. In the case of sparse Gaussian Process Regression models, this resampled estimator can be evaluated at negligible cost. We demonstrate the reliability of these estimates for the prediction of molecular energetics, and for the estimation of nuclear chemical shieldings in molecular crystals. Extension to estimate the uncertainty in energy differences, forces, or other correlated predictions is straightforward. This method can be easily applied to other machine learning schemes, and will be beneficial to make data-driven predictions more reliable, and to facilitate training-set optimization and active-learning strategies.

Paper Structure

This paper contains 15 sections, 19 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Log likelihood (LL) of predictions on the test set for different sub-sample sizes. After scaling the variances through maximum likelihood estimation -- internally (Int.) or on the validation set (CV) -- the final log likelihood is insensitive to the sub-sample size. A non-linear scaling of the uncertainty (N-L) further improves the uncertainty model.
  • Figure 1: Distribution of 1H chemical shielding predictions. The solid line shows the distribution of $P\left(\ln \epsilon_t\middle|\ln \sigma\right)$, while the dashed line shows the distribution of $P\left(\ln \epsilon_m\middle|\ln \sigma\right)$ (see Eq. \ref{['eq:conditional']}). The grayscale density plot corresponds to the marginal distribution of the predicted uncertainty $P(\ln \sigma)$.
  • Figure 2: Distribution of 1H chemical shielding predictions. The solid line shows the distribution of $P\left(\ln\epsilon_t\middle|\ln\sigma\right)$, while the dashed line shows the distribution of $P\left(\ln\epsilon_m\middle|\ln\sigma\right)$ (see Eq. \ref{['eq:conditional']}), including a non-linear scaling of the uncertainty corresponding to Eq. \ref{['eq:rs-nonlinear-scaling']}. The grayscale density plot corresponds to the marginal distribution of the predicted uncertainty $P(\ln\sigma)$.
  • Figure 2: Log likelihood (LL) of predictions on the test set for different sub-sample sizes. After scaling the variances through maximum likelihood estimation (internally or on the validation set), the final log likelihood is insensitive to the sub-sample size. A non-linear scaling of the uncertainty further improves the uncertainty model.
  • Figure 3: Distribution of formation energy differences. The solid line shows the distribution of $P\left(\ln \epsilon_t\middle|\ln \sigma\right)$, while the dashed line shows the distribution of $P\left(\ln \epsilon_m\middle|\ln \sigma\right)$ (see Eq. \ref{['eq:conditional']}). The grayscale density plot corresponds to the marginal distribution of the predicted uncertainty $P(\ln\sigma)$.