Table of Contents
Fetching ...

Flow Matching for Optimal Reaction Coordinates of Biomolecular System

Mingyuan Zhang, Zhicheng Zhang, Hao Wu, Yong Wang

TL;DR

The paper tackles the challenge of identifying optimal reaction coordinates (RCs) for reversible biomolecular dynamics without relying on pre-defined state labels or explicit transfer-operator eigenfunctions. It introduces Flow Matching for Reaction Coordinates (FMRC), which rephrases lumpability and decomposability as conditional-probability targets and implements them with an encoder and continuous normalizing-flow decoders trained via a simulation-free flow-matching objective. Across three biomolecular systems (CLN025, Trp-Cage, NTL-9), FMRC consistently preserves more of the system’s slow dynamics in a 2D RC space than VAC-based methods and exhibits lower training variance, while enabling deeper insights into metastable networks via MSMs and PCCA+ analyses. The study also demonstrates FMRC’s practical utility for enhanced sampling by bias deposition in Ala2, indicating potential for improving MSM construction and CV-based sampling with minimal prior knowledge and robust performance. Overall, FMRC offers a scalable, principled framework to obtain informative low-dimensional RCs that faithfully reflect the underlying transfer dynamics and facilitate downstream applications.

Abstract

We present flow matching for reaction coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov state models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. In addition, we successfully demonstrated the efficacy of FMRC for bias deposition in the enhanced sampling of a simple model system. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.

Flow Matching for Optimal Reaction Coordinates of Biomolecular System

TL;DR

The paper tackles the challenge of identifying optimal reaction coordinates (RCs) for reversible biomolecular dynamics without relying on pre-defined state labels or explicit transfer-operator eigenfunctions. It introduces Flow Matching for Reaction Coordinates (FMRC), which rephrases lumpability and decomposability as conditional-probability targets and implements them with an encoder and continuous normalizing-flow decoders trained via a simulation-free flow-matching objective. Across three biomolecular systems (CLN025, Trp-Cage, NTL-9), FMRC consistently preserves more of the system’s slow dynamics in a 2D RC space than VAC-based methods and exhibits lower training variance, while enabling deeper insights into metastable networks via MSMs and PCCA+ analyses. The study also demonstrates FMRC’s practical utility for enhanced sampling by bias deposition in Ala2, indicating potential for improving MSM construction and CV-based sampling with minimal prior knowledge and robust performance. Overall, FMRC offers a scalable, principled framework to obtain informative low-dimensional RCs that faithfully reflect the underlying transfer dynamics and facilitate downstream applications.

Abstract

We present flow matching for reaction coordinates (FMRC), a novel deep learning algorithm designed to identify optimal reaction coordinates (RC) in biomolecular reversible dynamics. FMRC is based on the mathematical principles of lumpability and decomposability, which we reformulate into a conditional probability framework for efficient data-driven optimization using deep generative models. While FMRC does not explicitly learn the well-established transfer operator or its eigenfunctions, it can effectively encode the dynamics of leading eigenfunctions of the system transfer operator into its low-dimensional RC space. We further quantitatively compare its performance with several state-of-the-art algorithms by evaluating the quality of Markov state models (MSM) constructed in their respective RC spaces, demonstrating the superiority of FMRC in three increasingly complex biomolecular systems. In addition, we successfully demonstrated the efficacy of FMRC for bias deposition in the enhanced sampling of a simple model system. Finally, we discuss its potential applications in downstream applications such as enhanced sampling methods and MSM construction.
Paper Structure (17 sections, 24 equations, 9 figures)

This paper contains 17 sections, 24 equations, 9 figures.

Figures (9)

  • Figure 1: 4D plots of an optimal RC that satisfies both lumpability (left) and decomposability (right) for a 2D model system. The x-axis and y-axis of the plot are the 2D coordinates of the system. The z-axis represents the relative free energy of each coordinate. The color bar represents the value of an optimal RC $\bold r = x_1$. The initial coordinate $\bold x_t$ of a transition with lag time $\tau$ is denoted in black and the final coordinate $\bold x_{t+\tau}$ is denoted in white. (left) An $\bold r(\bold x)$ isoline of the initial coordinates is shown as a blue dotted line. We further denote three initial coordinates on this isoline as $\mathbf x^*:=(x_1,x_2^*)$, $\mathbf x:=(x_1,x_2)$ and $\mathbf x^{**}:=(x_1,x_2^{**})$. Three Markovian transitions from these coordinates to a final coordinate $(x_1^*,x_2)$ are shown as black arrows, and the probabilities of these transitions are denoted as $p_\tau^{*}:=p_\tau(\mathbf x^*,\mathbf y)$, $p_\tau:=p_\tau(\mathbf x,\mathbf y)$ and $p_\tau^{**}:=p_\tau(\mathbf x^{**},\mathbf y)$ depending on their initial $x_2$ values. (right) An $\bold r(\bold y)$ isoline of the final coordinates is shown as a red dotted line. We further denote three final coordinates on this isoline as $\mathbf y^*:=(x_1^*,x_2^*)$, $\mathbf y:=(x_1^*,x_2)$ and $\mathbf y^{**}:=(x_1^*,x_2^{**})$. Three Markovian transitions from the initial coordinate $(x_1,x_2)$ to these coordinates are shown as backward black dotted arrows to emphasize that this is not a backward transition but the backward transition probability defined in equation 10. The probabilities of these transitions are denoted as $p_{-\tau}^{*}:=p_{-\tau}(\mathbf y^*,\mathbf x)$, $p_{-\tau}:=p_{-\tau}(\mathbf y,\mathbf x)$ and $p_{-\tau}^{**}:=p_{-\tau}(\mathbf y^{**},\mathbf x)$ depending on their final $x_2$ values.
  • Figure 2: The overall architecture of FMRC. For each time-lagged pair $\{\bold x, \bold x_t\}$, they are pre-processed by TICA (optional) and encoded by the encoder. The L-decoder or D-decoder then optimizes the vector field $\hat{v}_{s,\boldsymbol \theta}(\bold x^i_{t+\tau,s},\bold r^{FMRC}_t)$ or $\hat{v}_{s,\boldsymbol \theta}(\bold x^i_{t,s},\bold r^{FMRC}_{t+\tau})$ conditioned on (denoted as $\oplus$) the latent variable $\bold r^{FMRC}(\bold x_t)$ or $\bold r^{FMRC}(\bold x_{t+\tau})$ of the linear interpolation Gaussian path which transforms independent samples in $q_0^L$ or $q_0^D$ into independent samples in $q_1^L$ or $q_1^D$, respectively. Please refer to the main text for a detailed explanation.
  • Figure 3: Extensively sampled trajectories of three biomolecular systems used for FMRC performance evaluation and RC comparison in this study: (left) chignolin variant CLN025 (PDB ID: 2RVD), (middle) Trp-Cage (PDB ID: 2JOF) and (right) NTL-9 (PDB ID: 2HBA).
  • Figure 4: Comparison of $\hat{\lambda}_i^{MSM}$ of MSMs constructed in different RC spaces for CLN025. A red dashed line at $\lambda_i = 0.369$ has been drawn to denote a cutoff for the corresponding timescales lower than the $\tau^{MSM}$ for MSM construction. This indicates that the constructed MSM has failed to identify this slow process.
  • Figure 5: (A) The 2D FES projection of CLN025 and (B) the PCCA+ macrostate assignment projection for CLN025 in the normalized $\bold r^{FMRC}$ space where the best MSM was constructed.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2