Table of Contents
Fetching ...

Universally applicable and tunable graph-based coarse-graining for Machine learning force fields

Christoph Brunken, Sebastien Boyer, Mustafa Omar, Martin Maarand, Olivier Peltre, Solal Attias, Bakary N'tji Diallo, Anastasia Markina, Olaf Othersen, Oliver Bent

TL;DR

This paper tackles the challenge of creating a transferable coarse-grained ML force field (CG-MLFF) that generalizes across diverse biosystems (proteins, RNA, lipids). It introduces a MACE-based CG force field coupled with a tunable graph-based coarse-graining pipeline, trained on a fragmentation-derived dataset generated with semi-empirical references. A key contribution is the four-parameter tunable CG mapping, with coefficients $c_A$, $c_B$, $c_C$, and $c_D$, optimized by differential evolution to reduce force noise and improve training stability. While the tuned CG model often enhances training and qualitative MD behavior, MD stability is system-dependent; nonetheless, the results demonstrate the feasibility of a transferable CG-MLFF and outline a path toward including solvation and higher-accuracy references in future work. Overall, this work advances toward universally applicable CG force fields that can support large-scale biomolecular simulations.

Abstract

Coarse-grained (CG) force field methods for molecular systems are a crucial tool to simulate large biological macromolecules and are therefore essential for characterisations of biomolecular systems. While state-of-the-art deep learning (DL)-based models for all-atom force fields have improved immensely over recent years, we observe and analyse significant limitations of the currently available approaches for DL-based CG simulations. In this work, we present the first transferable DL-based CG force field approach (i.e., not specific to only one narrowly defined system type) applicable to a wide range of biosystems. To achieve this, our CG algorithm does not rely on hard-coded rules and is tuned to output coarse-grained systems optimised for minimal statistical noise in the ground truth CG forces, which results in significant improvement of model training. Our force field model is also the first CG variant that is based on the MACE architecture and is trained on a custom dataset created by a new approach based on the fragmentation of large biosystems covering protein, RNA and lipid chemistry. We demonstrate that our model can be applied in molecular dynamics simulations to obtain stable and qualitatively accurate trajectories for a variety of systems, while also discussing cases for which we observe limited reliability.

Universally applicable and tunable graph-based coarse-graining for Machine learning force fields

TL;DR

This paper tackles the challenge of creating a transferable coarse-grained ML force field (CG-MLFF) that generalizes across diverse biosystems (proteins, RNA, lipids). It introduces a MACE-based CG force field coupled with a tunable graph-based coarse-graining pipeline, trained on a fragmentation-derived dataset generated with semi-empirical references. A key contribution is the four-parameter tunable CG mapping, with coefficients , , , and , optimized by differential evolution to reduce force noise and improve training stability. While the tuned CG model often enhances training and qualitative MD behavior, MD stability is system-dependent; nonetheless, the results demonstrate the feasibility of a transferable CG-MLFF and outline a path toward including solvation and higher-accuracy references in future work. Overall, this work advances toward universally applicable CG force fields that can support large-scale biomolecular simulations.

Abstract

Coarse-grained (CG) force field methods for molecular systems are a crucial tool to simulate large biological macromolecules and are therefore essential for characterisations of biomolecular systems. While state-of-the-art deep learning (DL)-based models for all-atom force fields have improved immensely over recent years, we observe and analyse significant limitations of the currently available approaches for DL-based CG simulations. In this work, we present the first transferable DL-based CG force field approach (i.e., not specific to only one narrowly defined system type) applicable to a wide range of biosystems. To achieve this, our CG algorithm does not rely on hard-coded rules and is tuned to output coarse-grained systems optimised for minimal statistical noise in the ground truth CG forces, which results in significant improvement of model training. Our force field model is also the first CG variant that is based on the MACE architecture and is trained on a custom dataset created by a new approach based on the fragmentation of large biosystems covering protein, RNA and lipid chemistry. We demonstrate that our model can be applied in molecular dynamics simulations to obtain stable and qualitatively accurate trajectories for a variety of systems, while also discussing cases for which we observe limited reliability.

Paper Structure

This paper contains 20 sections, 16 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The training of the CG-MLFF model is divided in two steps, where the coarse-graining is purely a pre-processing step that converts the all-atom training set to a coarse-grained training set.
  • Figure 2: Illustration of the tunable CG model approach. First, we build four chemical graphs differing in edge weighting schemes (weights represented by different colors). Second, the eigendecomposition of the graph Laplacians result in node priorities depicted on the molecule in the centre (only on heavy atoms). Finally, the aggregation algorithm is applied based on these priorities to determine the CG beads and their positions are determined by centre of mass.
  • Figure 3: TICA plots for tuned ML model compared to Martini for a 10 ns trajectory. For the TICA plot of the standard ML model, see Figure \ref{['fig:tica_standard_model']} in Appendix \ref{['app:md_stability']}.
  • Figure 4: Overlay of all sampled geometries for one peptide example fragment generated from PDB ID 1OGA. The geometries are separated by sampling method, namely md sampling (on the left), mdgauss sampling (in the center), and OpenBabel sampling (on the right). This illustration demonstrates the strengths and weaknesses of each of the sampling methods, most notably that the MD-based methods sample the geometry more locally while the stochastic conformer generation allows the sample the potential energy surface in a more diverse manner.
  • Figure 5: Distribution of system sizes for all 4.9 million structures in our biosystems dataset. The largest fragment consists of 297 atoms.
  • ...and 7 more figures