Table of Contents
Fetching ...

Enhanced sampling of robust molecular datasets with uncertainty-based collective variables

Aik Rui Tan, Johannes C. B. Dietschreit, Rafael Gomez-Bombarelli

TL;DR

This work tackles the data-coverage challenge for robust neural interatomic potentials by treating predictive uncertainty as a universal collective variable to drive enhanced sampling. It combines MACE-based NNIPs, GMM-derived uncertainty with conformal-prediction calibration, and the eABF-GaMD framework to bias simulations toward under-sampled, informative regions of configuration space, demonstrated on alanine dipeptide. Key contributions include: (1) introducing single-model, latent-space GMM uncertainty as a CV for enhanced sampling, (2) calibrating this uncertainty with CP to reflect true errors, (3) integrating uncertainty-guided eABF-GaMD to achieve wide phase-space coverage with limited training data, and (4) validating improved data efficiency and MLIP robustness against ensemble-based uncertainty and uncertainty-as-bias-energy approaches. The method enables exploration beyond predefined slow coordinates, producing diverse training data and more accurate PMFs with substantially fewer configurations, which has significant implications for accelerating reliable MLIP development.

Abstract

Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.

Enhanced sampling of robust molecular datasets with uncertainty-based collective variables

TL;DR

This work tackles the data-coverage challenge for robust neural interatomic potentials by treating predictive uncertainty as a universal collective variable to drive enhanced sampling. It combines MACE-based NNIPs, GMM-derived uncertainty with conformal-prediction calibration, and the eABF-GaMD framework to bias simulations toward under-sampled, informative regions of configuration space, demonstrated on alanine dipeptide. Key contributions include: (1) introducing single-model, latent-space GMM uncertainty as a CV for enhanced sampling, (2) calibrating this uncertainty with CP to reflect true errors, (3) integrating uncertainty-guided eABF-GaMD to achieve wide phase-space coverage with limited training data, and (4) validating improved data efficiency and MLIP robustness against ensemble-based uncertainty and uncertainty-as-bias-energy approaches. The method enables exploration beyond predefined slow coordinates, producing diverse training data and more accurate PMFs with substantially fewer configurations, which has significant implications for accelerating reliable MLIP development.

Abstract

Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
Paper Structure (27 sections, 10 equations, 11 figures)

This paper contains 27 sections, 10 equations, 11 figures.

Figures (11)

  • Figure 1: (a), Structure of the alanine dipeptide molecule with carbon (C), nitrogen (N), oxygen (O), and hydrogen (H) atoms labeled in grey, blue, red, and white, respectively. Four backbone dihedral angles $\phi$, $\psi$, $\omega_1$, and $\omega_2$ are annotated. (b), Potential mean force (PMF) profile of the N-acetyl-L-alanine-N methylamide (alanine dipeptide) molecule on the backbone dihedral angles $\phi$ and $\psi$, provided by the umbrella sampling method using the amber ff19SB force fieldtian_ff19sb_2020. Regions with stable conformations are labeledmironov_systematic_2019. (c), Top subfigure shows Ramachandran plot of 100 configurations provided as the initial data set to train NNs in the first generation. Bottom subfigure shows distribution of the same data set but plotted with respect to the $\omega_1$ and $\omega_2$ backbone dihedral angles. (d), Top and bottom subfigures show distributions of test data set on the $\phi$-$\psi$ and $\omega_1$-$\omega_2$ backbone dihedral angles, respectively. NNs from all generations are not trained or validated on any data from the test set.
  • Figure 2: Cumulative exploration of configuration space projected onto the $\phi$-$\psi$ (left column) and $\omega_1$-$\omega_2$ (right column) plane of 10 NVT simulations at 300 K. Top) No biasing of any sort, middle) uncertainty-guided eABF, and bottom) uncertainty-guided eABF-GaMD.
  • Figure 3: (a), Hexbin plots representing data sets accumulative over generations of active learning, showcasing the distribution of $\phi$-$\psi$ (top) and $\omega_1$-$\omega_2$ (bottom) backbone dihedral angles during GMM-based uncertainty-guided eABF-GaMD simulations. (b), Number of total data points used for training the NNIPs in each generation. The initial value of 100 points at generation 1 indicates number of initial training data shown in Fig \ref{['fig:fig1']}a. (c), Fraction of grid-based coverage of the $\phi$-$\psi$ and $\omega_1$-$\omega_2$ space in each generation. Horizontal dashed lines describes coverage for the test set. (d) and (e), Mean absolute error of predicted energy and forces of configurations in the test set as generations proceed. Blue points indicate predictions from NNIPs of smaller MACE models with 4-channels, whereas orange points indicate predictions from the bigger models with 16-channels. The two axes in each plot correspond to the most common units used for atomistic simulations with MLIPs.
  • Figure 4: (a), Plots in the top row show the PMF profiles generated from the Amber ff19SB force field (Ground truth) and NNIPs in generation 1, 6, and 11, in the $\phi$-$\psi$ backbone dihedral angles. Middle row plots show the differences of NNIPs-generated PMF contours from the ground truth PMF. Red and blue regions indicate PMF overestimation and underestimation, respectively. Plots in the bottom row show the absolute value of the PMF disparities (middle row), and the values shown on the top left of the plots indicate the mean absolute error (MAE) of the NNIPs-generated PMF profiles. Note that all NNIPs used to generate the PMF contours are based on 16-channel MACE models. (b), Mean absolute errors (MAE) of the PMF profiles generated from NNIPs in generation 1, 3, 6, 9, and 11, with error bars showing ranges of MAEs.
  • Figure S1: Exploration of the "thermal coupling width", $\sigma$ parameter for guiding eABF-GaMD exploration at 300 K using ensemble-based uncertainty. The top row displays the predicted potential energy, $V(\mathbf{x})$ alongside the combined bias potential from eABF-GaMD, $V(\mathbf{x}) + V_\text{bias}(\mathbf{x})$ across simulation time. Plots in bottom row show the ensemble-predicted uncertainty, $\xi(\mathbf{x})$ and the fictitious variable, $\lambda$, plotted against simulation time.
  • ...and 6 more figures