Machine Learning Coarse-Grained Potentials of Protein Thermodynamics

Maciej Majewski; Adrià Pérez; Philipp Thölke; Stefan Doerr; Nicholas E. Charron; Toni Giorgino; Brooke E. Husic; Cecilia Clementi; Frank Noé; Gianni De Fabritiis

Machine Learning Coarse-Grained Potentials of Protein Thermodynamics

Maciej Majewski, Adrià Pérez, Philipp Thölke, Stefan Doerr, Nicholas E. Charron, Toni Giorgino, Brooke E. Husic, Cecilia Clementi, Frank Noé, Gianni De Fabritiis

TL;DR

The paper addresses predicting protein dynamics by learning thermodynamically consistent coarse-grained potentials using neural network potentials (NNP) trained with force-matching on a large all-atom MD dataset. It employs an alpha-carbon coarse-grained representation and builds a multi-protein training set from approximately 9 ms of unbiased MD across twelve proteins with diverse secondary structures, training both protein-specific and a general multi-protein model. The results show that CG simulations reproduce native and metastable states while accelerating dynamics by more than three orders of magnitude, with the general model achieving native structures for most targets and mutational cases, albeit with limitations for beta-sheet proteins. The work demonstrates the potential of transferable ML CG potentials for simulating protein thermodynamics and dynamics, and highlights data demands and current limitations in extrapolation and beta-sheet handling, pointing to future paths toward general-use CG force fields.

Abstract

A generalized understanding of protein dynamics is an unsolved scientific problem, the solution of which is critical to the interpretation of the structure-function relationships that govern essential biological processes. Here, we approach this problem by constructing coarse-grained molecular potentials based on artificial neural networks and grounded in statistical mechanics. For training, we build a unique dataset of unbiased all-atom molecular dynamics simulations of approximately 9 ms for twelve different proteins with multiple secondary structure arrangements. The coarse-grained models are capable of accelerating the dynamics by more than three orders of magnitude while preserving the thermodynamics of the systems. Coarse-grained simulations identify relevant structural states in the ensemble with comparable energetics to the all-atom systems. Furthermore, we show that a single coarse-grained potential can integrate all twelve proteins and can capture experimental structural features of mutated proteins. These results indicate that machine learning coarse-grained potentials could provide a feasible approach to simulate and understand protein dynamics.

Machine Learning Coarse-Grained Potentials of Protein Thermodynamics

TL;DR

Abstract

Paper Structure (1 section, 7 equations, 11 figures, 7 tables)

This paper contains 1 section, 7 equations, 11 figures, 7 tables.

Supporting Information

Figures (11)

Figure 1: Structures obtained from CG simulations of the protein-specific model (orange) and the general multi-protein-trained model (blue), compared to their respective experimental structures (grey). Structures were sampled from the native macrostate, which was identified as the macrostate containing the conformation with the minimum RMSD with respect to the experimental crystal structure. Ten conformations were sampled from each conformational state (visualized as transparent shadows)and the lowest RMSD conformation of each macrostate is displayed in cartoon representation, reconstructing the backbone structure from ${\alpha}$-carbon atoms. The native conformation of each protein, extracted from their corresponding crystal structure is shown in opaque grey. The text indicates the protein name and PDB ID for the experimental structure. WW-Domain and NTL9 results for the general model are not shown, as the model failed to recover the experimental structures. The statistics of native macrostates are included in Table \ref{['tab:Macrostate_stats']}.
Figure 1: Starting points (red dots) of coarse grained molecular dynamics overlayed on top of free energy surface across the first two TICA dimensions for each protein. The colorbar shows the energy values in the rage from 0 to 9 kcal/mol for Villin and $\alpha$3D, 6 kcal/mol for NTL9, and 7.5 kcal/mol for the remaining proteins.
Figure 2: Three individual CG trajectories selected from validation MD of Trp-Cage, WW-Domain and Protein G. Each visualized simulation, coloured from purple to yellow, explores the free energy surface, accesses multiple major basins and transitions among conformations. Top panels: 100 states sampled uniformly from the trajectory plotted over CG free energy surface, projected over the first two time-lagged independent components (TICs) for Trp-Cage (a), WW-Domain (b) and Protein G (c). The red line indicates the all-atom equilibrium density by showing the energy level above the free energy minimum with the value of 7.5 kcal/mol. The experimental structure is marked as a red star. Bottom panels: C$\alpha$-RMSD of the trajectory with reference to the experimental structure for Trp-Cage (d), WW-Domain (e) and Protein G (f).
Figure 2: Comparison between the reference MD (left), protein specific model (center) and general model (right) coarse-grained simulations free energy surface across the first two TICA dimensions for each protein. The free energy surface for each simulation set was obtained by binning over the first two TICA dimensions, dividing them into a 80×80 grid, and averaging the weights of the equilibrium probability in each bin computed by the Markov state model. The red line indicates the all-atom equilibrium density by showing the energy level above free energy minimum with the values of 9 kcal/mol for Villin and $\alpha$3D, 6 kcal/mol for NTL9, and 7.5 kcal/mol for the remaining proteins.
Figure 3: (a) Free energy surface of Protein G over the first two TICs for the all-atom MD simulations (top) and the coarse-grained simulations (bottom) using the protein-specific model. The circles identify different relevant minima (yellow - native, magenta - misfolded, cyan - partially folded, red - random coil). (b) The propensity of all the secondary structural elements of Protein G across the different macrostates, estimated using an RMSD threshold of 2Å for each structural element shown in the x-axis. (c) Sampled conformations from the macrostates of coarse-grained simulations corresponding to the marked minima in the free energy surfaces in (a). Sampled structure colors correspond to the minima colors in the free energy surface plot, with blurry lines of the same color showing additional conformations from the same state. Arrows represent the main pathways leading from the random coil to the native structure with the corresponding percentages of the total flux of each pathway.
...and 6 more figures

Machine Learning Coarse-Grained Potentials of Protein Thermodynamics

TL;DR

Abstract

Machine Learning Coarse-Grained Potentials of Protein Thermodynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (11)