Table of Contents
Fetching ...

MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

Kacper Kapuśniak, Cristian Gabellini, Michael Bronstein, Prudencio Tossou, Francesco Di Giovanni

TL;DR

The paper tackles the high computational cost of all-atom MD by reframing generative modeling of protein dynamics through Markov State Models (MSMs). It introduces MSM Emulators, particularly Markov Space Flow Matching (MarS-FM), which learns to sample transitions between discrete MSM states via Flow Matching, decoupling from fine-grained temporal dynamics. MarS-FM achieves large speedups (over 600× in some settings) and more accurate reproduction of structural observables across tetrapeptides and large protein domains with strict sequence dissimilarity, outperforming MD-Emu baselines and recovering MSM-like statistics. This approach offers a scalable, robust pathway to generate diverse, thermodynamically consistent protein conformations, with potential impact on drug discovery and protein engineering, while noting limitations and avenues for future work such as extending to complexes and sequence-based initializations.

Abstract

Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.

MarS-FM: Generative Modeling of Molecular Dynamics via Markov State Models

TL;DR

The paper tackles the high computational cost of all-atom MD by reframing generative modeling of protein dynamics through Markov State Models (MSMs). It introduces MSM Emulators, particularly Markov Space Flow Matching (MarS-FM), which learns to sample transitions between discrete MSM states via Flow Matching, decoupling from fine-grained temporal dynamics. MarS-FM achieves large speedups (over 600× in some settings) and more accurate reproduction of structural observables across tetrapeptides and large protein domains with strict sequence dissimilarity, outperforming MD-Emu baselines and recovering MSM-like statistics. This approach offers a scalable, robust pathway to generate diverse, thermodynamically consistent protein conformations, with potential impact on drug discovery and protein engineering, while noting limitations and avenues for future work such as extending to complexes and sequence-based initializations.

Abstract

Molecular Dynamics (MD) is a powerful computational microscope for probing protein functions. However, the need for fine-grained integration and the long timescales of biomolecular events make MD computationally expensive. To address this, several generative models have been proposed to generate surrogate trajectories at lower cost. Yet, these models typically learn a fixed-lag transition density, causing the training signal to be dominated by frequent but uninformative transitions. We introduce a new class of generative models, MSM Emulators, which instead learn to sample transitions across discrete states defined by an underlying Markov State Model (MSM). We instantiate this class with Markov Space Flow Matching (MarS-FM), whose sampling offers more than two orders of magnitude speedup compared to implicit- or explicit-solvent MD simulations. We benchmark Mars-FM ability to reproduce MD statistics through structural observables such as RMSD, radius of gyration, and secondary structure content. Our evaluation spans protein domains (up to 500 residues) with significant chemical and structural diversity, including unfolding events, and enforces strict sequence dissimilarity between training and test sets to assess generalization. Across all metrics, MarS-FM outperforms existing methods, often by a substantial margin.

Paper Structure

This paper contains 43 sections, 12 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison between existing approaches (MD-Emus) and our proposed novel class (MSM-Emus). MD-Emus learn transitions within a state (i.e. energy minima) well but could fail to generate transitions across different states (minima) since they are constrained by the data imbalance intrinsic to MD. Conversely, our framework learns to sample from the distribution induced by a Markov State Model (MSM). This modeling shift means that generative models are decoupled from temporal dynamics and can better learn to sample inter-state transitions. During sampling, MSM-Emus can generate conformations in parallel or be combined with existing MD-Emus to capture both large conformational changes as well as local dynamics within states.
  • Figure 2: Clustering of MD conformations of protein 3ma5A00 from MD-Cath. We report cluster centres and Markov chain transitions from one representative state. We note how MSM states capture large structural differences (folded vs unfolded).
  • Figure 3: Hierarchical sampling used for MarS-FM.
  • Figure 4: TICA plot for 3 random peptides in the test set, comparing MD ground-truth, MDGen, MarS (ours) and MarS + MDGen (ours). Our frameworks explore modes that are otherwise entirely ignored by MDGen. Similar plots are reported in Appendix \ref{['appendix: sec: Additional Results']}.
  • Figure 5: First 4 samples generated by MDGen and MarS-FM for the domain 2ynmD03 in the test set. As MarS-FM interpolates among states independently of temporal dynamics, it can explore the energy landscape more efficiently. In fact, the secondary structure content varies significantly among these 4 samples (note that there is no ordering as they are generated in parallel). Conversely, MDGen samples all belong to the same energy minimum which reduced sampling efficiency and exploration.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1