Table of Contents
Fetching ...

Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) approach for learning molecular thermodynamics and kinetics

Ziyue Zou, Dedi Wang, Pratyush Tiwary

TL;DR

The paper introduces GNN-SPIB, a framework that jointly leverages graph neural networks and the State Predictive Information Bottleneck to learn latent reaction coordinates directly from atomic coordinates for enhanced sampling. By integrating a GNN head into SPIB, the method yields permutation-invariant, system-size-agnostic representations that reliably capture slow dynamics without hand-crafted features. Across LJ7, alanine dipeptide, and alanine tetrapeptide, the learned coordinates produce thermodynamic and kinetic estimates comparable to conventional expert-based CVs when used to bias metadynamics and infrequent metadynamics. This approach holds promise for applying enhanced sampling to complex systems where optimal reaction coordinates are unknown a priori, with potential extensions to higher-order representations and external data sources.

Abstract

Molecular dynamics simulations offer detailed insights into atomic motions but face timescale limitations. Enhanced sampling methods have addressed these challenges but even with machine learning, they often rely on pre-selected expert-based features. In this work, we present the Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) framework, which combines graph neural networks and the State Predictive Information Bottleneck to automatically learn low-dimensional representations directly from atomic coordinates. Tested on three benchmark systems, our approach predicts essential structural, thermodynamic and kinetic information for slow processes, demonstrating robustness across diverse systems. The method shows promise for complex systems, enabling effective enhanced sampling without requiring pre-defined reaction coordinates or input features.

Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) approach for learning molecular thermodynamics and kinetics

TL;DR

The paper introduces GNN-SPIB, a framework that jointly leverages graph neural networks and the State Predictive Information Bottleneck to learn latent reaction coordinates directly from atomic coordinates for enhanced sampling. By integrating a GNN head into SPIB, the method yields permutation-invariant, system-size-agnostic representations that reliably capture slow dynamics without hand-crafted features. Across LJ7, alanine dipeptide, and alanine tetrapeptide, the learned coordinates produce thermodynamic and kinetic estimates comparable to conventional expert-based CVs when used to bias metadynamics and infrequent metadynamics. This approach holds promise for applying enhanced sampling to complex systems where optimal reaction coordinates are unknown a priori, with potential extensions to higher-order representations and external data sources.

Abstract

Molecular dynamics simulations offer detailed insights into atomic motions but face timescale limitations. Enhanced sampling methods have addressed these challenges but even with machine learning, they often rely on pre-selected expert-based features. In this work, we present the Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) framework, which combines graph neural networks and the State Predictive Information Bottleneck to automatically learn low-dimensional representations directly from atomic coordinates. Tested on three benchmark systems, our approach predicts essential structural, thermodynamic and kinetic information for slow processes, demonstrating robustness across diverse systems. The method shows promise for complex systems, enabling effective enhanced sampling without requiring pre-defined reaction coordinates or input features.
Paper Structure (10 sections, 2 equations, 4 figures)

This paper contains 10 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Schematic of the workflow proposed in this work. Trajectories from unbiased/biased simulation are converted into timeseries graph data. The batched large graph is fed into graph neural networks. The GNN-SPIB model is then trained to predict the state labels of time frame in lag time $\Delta t$ as introduced in the original SPIB pipeline(box in black). The biasing variables (i.e., $z_1$ and $z_2$) are then used in enhanced sampling methods (box in red).
  • Figure 2: Summary of WTmetaD simulation results biasing along machine learnt reaction coordinates, $z_1$ and $z_2$, in Lennard-Jones 7 system. a) A schematic of how the 2-d reaction coordinate, $z_1$ and $z_2$, the output of the encoder, is computed with node features $\{V_n\}$ and edge features $\{L_e\}$ where $\oplus$ denotes concatenate operation. b) State labels predicted by the model in RC space projected along the training data collected at $k_BT=0.2\epsilon$. The highest contour line is at 10 $\epsilon$ and each of the lines is separated by 2 $\epsilon$. c) Reweighted free energy surface of WTmetaD using $\{z_1,z_2\}$ at $k_BT=0.1\epsilon$ projected onto expert-based CV space, $\mu_2^2$ and $\mu_3^3$ with state definitions in colored boxes. d) Box plots of free energy differences from c) between sampled metastable states comparing among conventional long MD and WTmetaD biasing expert-based CVs. e) Characteristic transition times of $c_0\xrightarrow{}c_3$ at $k_BT=0.1\epsilon$ estimated by imetaD simulations using expert-based and machine learned RCs. Benchmark is drawn from standard MD simulation in cyan. The shaded region and error bar correspond to the 95$\%$ confidence interval. Colors in markers indicate the $p$-value from K-S test, where $p$-value less than 0.05 suggests the result is unreliable.
  • Figure 3: Summary of WTmetaD simulation results for alanine dipeptide system: a) representation of alanine dipeptide moleculeHUMPHREY1996VMD with definition to expert-based CV, $\phi$ and $\psi$. Graph representation is constructed only with heavy atoms and atomic labels are followed by assigned node index to graphs; b)a schematic of how the reaction coordinate, $\{z_1,z_2\}$, is computed with node $\{V_n\}$ and edge $\{L_e\}$ features; c) state label predictions in different colors from the model decoder with contour lines separated by 3 $kJ/mol$; d) reweighted free energy surface biasing the machine learnt RC at 300 $K$ using $\{\phi$,$\psi\}$ projection with conformer definitions in boxes; e) free energy differences with state defined in d) under different sampling schemes; and f) kinetic measurements of transition ($C7_{eq},C5$) to $C7_{ax}$ at 300 $K$ with imetaD simulation. Dashed line in cyan is the benchmark MD simulation and marker points are results from imetaD using different RCs. The shaded region and error bars are the 95$\%$ confidence intervals. $p$-values from K-S test to imetaD simulations are reflected by the colors and when $p$-value is less than 0.05 (in grey) the result is unreliable.
  • Figure 4: Summary of WTmetaD simulation results in alanine tetrapeptide system: a) a schematic of reaction coordinate construction with a combination of embeddings of each graph convolution layers via skip connections before graph-level pooling operations; b) representation of alanine tetrapeptide moleculeHUMPHREY1996VMD with definition to characteristic dihedral angles, $\phi_1$, $\phi_2$, $\phi_3$, $\psi_1$, $\psi_2$, and $\psi_3$ and only heavy atoms are involved during graph construction; c) the learnt latent variable space $\{z_1, z_2\}$ with state labels predicted by the model on training data and free energy surface with contours separated by 2 $kJ/mol$; d) reweighted free energy surface of WTmetaD simulations using 2-d $\{z_1,z_2\}$ variables at 350 $K$ projected onto $\{\phi_1,\phi_2,\phi_3\}$ space; e) tabulated free energy differences between all conformers from brute force MD simulations, WTmetaD simulations biasing $\{\phi_1,\phi_2,\phi_3\}$, and WTmetaD simulations biasing $\{z_1, z_2\}$; and f) characteristic transition times of $s_1 \xrightarrow{}s_7$ measured by imetaD simulations using different variables at 400 $K$. Dashed line in cyan is the benchmark MD simulation and marker points are results from imetaD using different RCs. The shaded region and error bars are the 95$\%$ confidence intervals. $p$-values from the K-S test to imetaD simulations are reflected by the colors and when $p$-value is greater than 0.05, the estimation is reliable.