Table of Contents
Fetching ...

A Simple and Scalable Kernel Density Approach for Reliable Uncertainty Quantification in Atomistic Machine Learning

Daniel Willimetz, Lukáš Grajciar

TL;DR

A scalable, GPU-accelerated uncertainty quantification framework based on k-nearest-neighbor kernel density estimation (KDE) in a PCA-reduced descriptor space that efficiently detects sparsely sampled regions in large, high-dimensional data sets and provides a transferable, model-agnostic uncertainty metric without requiring retraining costly model ensembles.

Abstract

Machine learning models are increasingly used to predict material properties and accelerate atomistic simulations, but the reliability of their predictions depends on the representativeness of the training data. We present a scalable, GPU-accelerated uncertainty quantification framework based on $k$-nearest-neighbor kernel density estimation (KDE) in a PCA-reduced descriptor space. This method efficiently detects sparsely sampled regions in large, high-dimensional datasets and provides a transferable, model-agnostic uncertainty metric without requiring retraining costly model ensembles. The framework is validated across diverse case studies varying in: i) chemistry, ii) prediction models (including foundational neural network), iii) descriptors used for KDE estimation, and iv) properties whose uncertainty is sought. In all cases, the KDE-based score reliably flags extrapolative configurations, correlates well with conventional ensemble-based uncertainties, and highlights regions of reduced prediction trustworthiness. The approach offers a practical route for improving the interpretability, robustness, and deployment readiness of ML models in materials science.

A Simple and Scalable Kernel Density Approach for Reliable Uncertainty Quantification in Atomistic Machine Learning

TL;DR

A scalable, GPU-accelerated uncertainty quantification framework based on k-nearest-neighbor kernel density estimation (KDE) in a PCA-reduced descriptor space that efficiently detects sparsely sampled regions in large, high-dimensional data sets and provides a transferable, model-agnostic uncertainty metric without requiring retraining costly model ensembles.

Abstract

Machine learning models are increasingly used to predict material properties and accelerate atomistic simulations, but the reliability of their predictions depends on the representativeness of the training data. We present a scalable, GPU-accelerated uncertainty quantification framework based on -nearest-neighbor kernel density estimation (KDE) in a PCA-reduced descriptor space. This method efficiently detects sparsely sampled regions in large, high-dimensional datasets and provides a transferable, model-agnostic uncertainty metric without requiring retraining costly model ensembles. The framework is validated across diverse case studies varying in: i) chemistry, ii) prediction models (including foundational neural network), iii) descriptors used for KDE estimation, and iv) properties whose uncertainty is sought. In all cases, the KDE-based score reliably flags extrapolative configurations, correlates well with conventional ensemble-based uncertainties, and highlights regions of reduced prediction trustworthiness. The approach offers a practical route for improving the interpretability, robustness, and deployment readiness of ML models in materials science.

Paper Structure

This paper contains 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) Dependence of KDE-density evaluation time on the PCA-reduced dimension of the original atomic descriptor with the dimension of 256. The test is carried out for a fixed set of 228,000 atoms. (b) Scaling of computational time with the number of atomic descriptors in the database for the 1,000 snapshots of the silicatene–Pt$_6$ system (containing 228,000 atoms in total with a 256-dimensional descriptor per atom).
  • Figure 2: Comparison of the ensemble method (a) and the KDE method (b) for 500 ps MD simulations of Pt$_5$ clusters inside zeolite CHA at 750 K (left) and Pt$_6$ clusters on silicatene at 2000 K (right). The "mean", "minimum", and "maximum" data points represent the corresponding (mean, minimum and maximum) KDE densities evaluated for each atomic environment in each frame.
  • Figure 3: The minimum KDE density of all atomic environments for 100 ps MD simulation of MFI zeolite at 350 K with 1 water per aluminium for Si/Al ratios of 95 (blue) and 11 (red).
  • Figure 4: Comparison of the ensemble method (left) and the KDE method (right) on the rMD17 dataset. $\hat{F}$ denotes the reference forces obtained from DFT calculations, and $\sigma_F^2$ represents the variance of the predicted forces from the neural network potentials in the ensemble.