Active Learning for Machine Learning Driven Molecular Dynamics
Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu
TL;DR
This work tackles the challenge of efficiently exploring biomolecular conformations with machine-learned coarse-grained potentials that risk under-sampling. It introduces an active-learning loop around the CGSchNet model that uses RMSD-based frame selection to identify coverage gaps, backmaps selected frames to all-atom space, queries an AA oracle via OpenMM, and augments the CG training data with the resulting AA information. Empirically, this approach expands conformational exploration and yields a 33.05% improvement in Wasserstein-1 distance in TICA space on a Chignolin benchmark, with notable improvements in bond-length and bond-angle distributions and increased trajectory stability. The framework preserves CG efficiency while providing targeted, on-the-fly corrections to data coverage, offering a practical path to more reliable ML-driven MD in drug discovery and related fields, and highlighting avenues for improved backmapping and acquisition strategies.
Abstract
Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05\% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.
