Enhanced sampling of robust molecular datasets with uncertainty-based collective variables
Aik Rui Tan, Johannes C. B. Dietschreit, Rafael Gomez-Bombarelli
TL;DR
This work tackles the data-coverage challenge for robust neural interatomic potentials by treating predictive uncertainty as a universal collective variable to drive enhanced sampling. It combines MACE-based NNIPs, GMM-derived uncertainty with conformal-prediction calibration, and the eABF-GaMD framework to bias simulations toward under-sampled, informative regions of configuration space, demonstrated on alanine dipeptide. Key contributions include: (1) introducing single-model, latent-space GMM uncertainty as a CV for enhanced sampling, (2) calibrating this uncertainty with CP to reflect true errors, (3) integrating uncertainty-guided eABF-GaMD to achieve wide phase-space coverage with limited training data, and (4) validating improved data efficiency and MLIP robustness against ensemble-based uncertainty and uncertainty-as-bias-energy approaches. The method enables exploration beyond predefined slow coordinates, producing diverse training data and more accurate PMFs with substantially fewer configurations, which has significant implications for accelerating reliable MLIP development.
Abstract
Generating a data set that is representative of the accessible configuration space of a molecular system is crucial for the robustness of machine learned interatomic potentials (MLIP). However, the complexity of molecular systems, characterized by intricate potential energy surfaces (PESs) with numerous local minima and energy barriers, presents a significant challenge. Traditional methods of data generation, such as random sampling or exhaustive exploration, are either intractable or may not capture rare, but highly informative configurations. In this study, we propose a method that leverages uncertainty as the collective variable (CV) to guide the acquisition of chemically-relevant data points, focusing on regions of the configuration space where ML model predictions are most uncertain. This approach employs a Gaussian Mixture Model-based uncertainty metric from a single model as the CV for biased molecular dynamics simulations. The effectiveness of our approach in overcoming energy barriers and exploring unseen energy minima, thereby enhancing the data set in an active learning framework, is demonstrated on the alanine dipeptide benchmark system.
