Table of Contents
Fetching ...

Active Learning for Machine Learning Driven Molecular Dynamics

Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu

TL;DR

This work tackles the challenge of efficiently exploring biomolecular conformations with machine-learned coarse-grained potentials that risk under-sampling. It introduces an active-learning loop around the CGSchNet model that uses RMSD-based frame selection to identify coverage gaps, backmaps selected frames to all-atom space, queries an AA oracle via OpenMM, and augments the CG training data with the resulting AA information. Empirically, this approach expands conformational exploration and yields a 33.05% improvement in Wasserstein-1 distance in TICA space on a Chignolin benchmark, with notable improvements in bond-length and bond-angle distributions and increased trajectory stability. The framework preserves CG efficiency while providing targeted, on-the-fly corrections to data coverage, offering a practical path to more reliable ML-driven MD in drug discovery and related fields, and highlighting avenues for improved backmapping and acquisition strategies.

Abstract

Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05\% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.

Active Learning for Machine Learning Driven Molecular Dynamics

TL;DR

This work tackles the challenge of efficiently exploring biomolecular conformations with machine-learned coarse-grained potentials that risk under-sampling. It introduces an active-learning loop around the CGSchNet model that uses RMSD-based frame selection to identify coverage gaps, backmaps selected frames to all-atom space, queries an AA oracle via OpenMM, and augments the CG training data with the resulting AA information. Empirically, this approach expands conformational exploration and yields a 33.05% improvement in Wasserstein-1 distance in TICA space on a Chignolin benchmark, with notable improvements in bond-length and bond-angle distributions and increased trajectory stability. The framework preserves CG efficiency while providing targeted, on-the-fly corrections to data coverage, offering a practical path to more reliable ML-driven MD in drug discovery and related fields, and highlighting avenues for improved backmapping and acquisition strategies.

Abstract

Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05\% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.

Paper Structure

This paper contains 9 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Our active learning pipeline showing the model-training $\Leftrightarrow$ data-generation loop. Each round trains a coarse-grained (CG) model, runs a CG simulation, selects least-covered frames via RMSD distance proxy, backmaps them to all-atom (AA) space, runs short OpenMM simulations, and projects AA$\to$CG to augment the dataset and retrain. By querying the oracle only where coverage is poor, the loop increases conformational coverage at minimal AA simulation cost.
  • Figure 2: Base model output (black), model output after active learning (brown), and training data per active learning iteration (purple $\rightarrow$ red), plotted as a histogram of RMSD values from the reference frame.
  • Figure A.1: Benchmark suite evaluating the base model's performance before applying the active learning loop.
  • Figure A.2: Benchmark suite evaluating the base model's performance after applying the active learning loop. Note: Without an unreasonable amount of computation time, the benchmark is unable to explore the entire space, so the first graph won't reach the entire ground truth distribution. However, the improvement is still visible, as the post-AL benchmark depicts a more broad distribution, implying more exploration and coverage.