Table of Contents
Fetching ...

Enhanced Sampling for Efficient Learning of Coarse-Grained Machine Learning Potentials

Weilong Chen, Franz Görlich, Paul Fuchs, Julija Zavadlav

TL;DR

This work tackles the data inefficiency and poor transition-region sampling in learning coarse-grained machine learning potentials via force matching. It proves that mean forces are invariant under CG-coordinate bias when forces are recomputed with the unbiased potential, enabling biased data to be used without reweighting. By employing umbrella sampling and well-tempered metadynamics, the authors accelerate data generation and enrich transition-region sampling, demonstrated on Müller–Brown and capped alanine in water. The approach yields more accurate and stable CG PMFs without incorporating physics priors, highlighting enhanced sampling as a practical framework for data-efficient CG modeling and its potential for broader application.

Abstract

Coarse-graining (CG) enables molecular dynamics (MD) simulations of larger systems and longer timescales that are otherwise infeasible with atomistic models. Machine learning potentials (MLPs), with their capacity to capture many-body interactions, can provide accurate approximations of the potential of mean force (PMF) in CG models. Current CG MLPs are typically trained in a bottom-up manner via force matching, which in practice relies on configurations sampled from the unbiased equilibrium Boltzmann distribution to ensure thermodynamic consistency. This convention poses two key limitations: first, sufficiently long atomistic trajectories are needed to reach convergence; and second, even once equilibrated, transition regions remain poorly sampled. To address these issues, we employ enhanced sampling to bias along CG degrees of freedom for data generation, and then recompute the forces with respect to the unbiased potential. This strategy simultaneously shortens the simulation time required to produce equilibrated data and enriches sampling in transition regions, while preserving the correct PMF. We demonstrate its effectiveness on the Müller-Brown potential and capped alanine, achieving notable improvements. Our findings support the use of enhanced sampling for force matching as a promising direction to improve the accuracy and reliability of CG MLPs.

Enhanced Sampling for Efficient Learning of Coarse-Grained Machine Learning Potentials

TL;DR

This work tackles the data inefficiency and poor transition-region sampling in learning coarse-grained machine learning potentials via force matching. It proves that mean forces are invariant under CG-coordinate bias when forces are recomputed with the unbiased potential, enabling biased data to be used without reweighting. By employing umbrella sampling and well-tempered metadynamics, the authors accelerate data generation and enrich transition-region sampling, demonstrated on Müller–Brown and capped alanine in water. The approach yields more accurate and stable CG PMFs without incorporating physics priors, highlighting enhanced sampling as a practical framework for data-efficient CG modeling and its potential for broader application.

Abstract

Coarse-graining (CG) enables molecular dynamics (MD) simulations of larger systems and longer timescales that are otherwise infeasible with atomistic models. Machine learning potentials (MLPs), with their capacity to capture many-body interactions, can provide accurate approximations of the potential of mean force (PMF) in CG models. Current CG MLPs are typically trained in a bottom-up manner via force matching, which in practice relies on configurations sampled from the unbiased equilibrium Boltzmann distribution to ensure thermodynamic consistency. This convention poses two key limitations: first, sufficiently long atomistic trajectories are needed to reach convergence; and second, even once equilibrated, transition regions remain poorly sampled. To address these issues, we employ enhanced sampling to bias along CG degrees of freedom for data generation, and then recompute the forces with respect to the unbiased potential. This strategy simultaneously shortens the simulation time required to produce equilibrated data and enriches sampling in transition regions, while preserving the correct PMF. We demonstrate its effectiveness on the Müller-Brown potential and capped alanine, achieving notable improvements. Our findings support the use of enhanced sampling for force matching as a promising direction to improve the accuracy and reliability of CG MLPs.

Paper Structure

This paper contains 15 sections, 20 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of the enhanced sampling force matching method. (A) Classical force matching: positions $\mathbf{r}$ and forces $\mathbf{f}$ from an unbiased atomistic MD simulation are used to learn the potential of mean force (PMF) $U$. (B) Enhanced sampling force matching (this work): Configurations are obtained via enhanced sampling, reducing the required simulation time (light red region). The forces acting during the biased simulation are $\hat{\mathbf{f}}_W(\mathbf{r}_W) = -\nabla_{\mathbf{r}}\!\left(u(\mathbf{r}_W) + W(\xi(\mathbf{r}_W))\right)$, while the unbiased forces used for training are recomputed using the unbiased potential, $\mathbf{f}_W(\mathbf{r}_W) = -\nabla_{\mathbf{r}} u(\mathbf{r}_W)$, which incurs minimal additional computational cost. The PMF is learned from the biased configurations $\mathbf{r}_W$ and their corresponding recomputed forces $\mathbf{f}_W$.
  • Figure 2: Finite data size effects in the low-dimensional Müller–Brown system. (A) Two-dimensional Müller–Brown potential energy surface (functional form given in the Supporting Information). (B) Exact free-energy profile along the $x$-axis. (C) Marginal probability density along the $x$-axis. (D) Instantaneous force samples from the unbiased dataset projected onto the $x$-axis, shown together with the exact mean force and bin-averaged estimates from biased and unbiased datasets of equal size. The unit of the force is $k_BT$.
  • Figure 3: Unbiased mean force recovery in the low-dimensional Müller–Brown system. In all panels, gray dots show instantaneous forces from the corresponding simulations. Overlaid curves denote the exact mean force and bin-averaged estimates: (A) Unbiased simulation. (B) Biased along $x$: bin-averaged estimates include both direct and recomputed (RC) mean forces. (C) Biased along $y$: bin-averaged estimates include direct, RC, and reweighted (RW) mean forces from importance sampling. (D) Biased along $(x,y)$: bin-averaged estimates include direct, RC, and RW mean forces.
  • Figure 4: Results for the low-dimensional Müller--Brown potential. (A) Root-mean-square error (RMSE) of predicted forces as a function of the number of training samples ($N$). RMSE values are computed relative to the exact mean force over 500 equally spaced points in the interval $x \in [20,45]$. Results are shown for models trained on biased datasets generated with umbrella sampling and on unbiased datasets. Error bars represent the standard deviation across five independently trained models with different random seeds. (B) Exact mean force compared with model-predicted forces trained on biased datasets obtained via umbrella sampling. $N$ indicates the number of training samples; uncertainties reflect variations across five independently trained models. (C) Same as (B), but using unbiased datasets for training.
  • Figure 5: Coarse-graining mapping of capped alanine and the resulting free energy profiles. (A) Mapping from the all-atom solvated model (left) to the coarse-grained (CG) model retaining the ten heavy backbone atoms (right). (B) Free-energy surfaces and one-dimensional dihedral distributions for datasets and CG model simulations. The left column (“Dataset”) shows the reference 2 µs unbiased MD free-energy surface and the well-tempered metadynamics (WT MetaD, 10 ns) dataset used for model training. The right columns (“Simulation with trained MACE MLPs”) show the corresponding free-energy surfaces and one-dimensional $\phi/\psi$ dihedral distributions obtained from CG simulations using models trained on the respective datasets. Mean values and standard deviations (shaded regions) are computed from 100 independent CG trajectories of 100 ns each.
  • ...and 1 more figures