Table of Contents
Fetching ...

Using pretrained graph neural networks with token mixers as geometric featurizers for conformational dynamics

Zihan Pengmei, Chatipat Lorpaiboon, Spencer C. Guo, Jonathan Weare, Aaron R. Dinner

TL;DR

This work addresses the challenge of identifying informative, low-dimensional representations of molecular conformational dynamics without extensive feature engineering. It introduces geom2vec, a framework that pretrains equivariant graph neural networks on a denoising objective to learn transferable geometric embeddings, then uses these fixed representations for downstream dynamics tasks such as VAMPnets and SPIB. By decoupling GNN pretraining from task-specific training and employing coarse-graining into token-based structures with token mixers, geom2vec enables all-atom analyses of larger systems with limited computational resources, while enhancing interpretability via attention maps and CVs. The results on chignolin, trp-cage, and villin demonstrate improved slow-mode discovery and metastable-state identification, with practical implications for robust, scalable analysis of biomolecular dynamics.

Abstract

Identifying informative low-dimensional features that characterize dynamics in molecular simulations remains a challenge, often requiring extensive manual tuning and system-specific knowledge. Here, we introduce geom2vec, in which pretrained graph neural networks (GNNs) are used as universal geometric featurizers. By pretraining equivariant GNNs on a large dataset of molecular conformations with a self-supervised denoising objective, we obtain transferable structural representations that are useful for learning conformational dynamics without further fine-tuning. We show how the learned GNN representations can capture interpretable relationships between structural units (tokens) by combining them with expressive token mixers. Importantly, decoupling training the GNNs from training for downstream tasks enables analysis of larger molecular graphs (such as small proteins at all-atom resolution) with limited computational resources. In these ways, geom2vec eliminates the need for manual feature selection and increases the robustness of simulation analyses.

Using pretrained graph neural networks with token mixers as geometric featurizers for conformational dynamics

TL;DR

This work addresses the challenge of identifying informative, low-dimensional representations of molecular conformational dynamics without extensive feature engineering. It introduces geom2vec, a framework that pretrains equivariant graph neural networks on a denoising objective to learn transferable geometric embeddings, then uses these fixed representations for downstream dynamics tasks such as VAMPnets and SPIB. By decoupling GNN pretraining from task-specific training and employing coarse-graining into token-based structures with token mixers, geom2vec enables all-atom analyses of larger systems with limited computational resources, while enhancing interpretability via attention maps and CVs. The results on chignolin, trp-cage, and villin demonstrate improved slow-mode discovery and metastable-state identification, with practical implications for robust, scalable analysis of biomolecular dynamics.

Abstract

Identifying informative low-dimensional features that characterize dynamics in molecular simulations remains a challenge, often requiring extensive manual tuning and system-specific knowledge. Here, we introduce geom2vec, in which pretrained graph neural networks (GNNs) are used as universal geometric featurizers. By pretraining equivariant GNNs on a large dataset of molecular conformations with a self-supervised denoising objective, we obtain transferable structural representations that are useful for learning conformational dynamics without further fine-tuning. We show how the learned GNN representations can capture interpretable relationships between structural units (tokens) by combining them with expressive token mixers. Importantly, decoupling training the GNNs from training for downstream tasks enables analysis of larger molecular graphs (such as small proteins at all-atom resolution) with limited computational resources. In these ways, geom2vec eliminates the need for manual feature selection and increases the robustness of simulation analyses.
Paper Structure (33 sections, 11 equations, 23 figures, 3 tables, 3 algorithms)

This paper contains 33 sections, 11 equations, 23 figures, 3 tables, 3 algorithms.

Figures (23)

  • Figure 1: The geom2vec workflow. (a) A GNN encoder is pretrained using a denoising objective on a dataset of structures of diverse molecules. (b) Geometric representations for configurations from molecular simulations are obtained by performing inference with the pretrained GNN encoder. (c) The representations are used as inputs to a downstream task head (here, a VAMPnet or SPIB), which is trained separately.
  • Figure 2: VAMPnets with various geom2vec architectures. The amount of training data was varied by dividing the training data into 20 trajectory segments of equal length and then randomly selecting the indicated fraction for training. The validation set is held fixed as the second half of each trajectory. Error bars show standard errors over three independent runs.
  • Figure 3: Potential of mean force (PMF) of chignolin as a function of the first two CVs learned by a VAMPnet trained with SubMixer. Contours are drawn every 1 kcal/mol. See Figure \ref{['fig:vamp_chig_cvs']}, \ref{['fig:vamp_trpcage_cvs']}, and \ref{['fig:vamp_villin_cvs']} for corresponding plots for other architectures and proteins.
  • Figure 4: Chignolin VAMPnet (with SubMixer) CVs as a function of two physical coordinates: the fraction of native contacts and the $\chi_1$ side chain dihedral angle of Thr6 (left) or Thr8 (right). $\bar{Q}$ is the fraction of native contacts smoothed with a 1-ns moving window centered on each time point. We define native contacts as two residues that are three or more positions apart in sequence and have at least one distance between non-hydrogen atoms that is less than 4.5 $\text{\AA}$ in the crystal structure (5AWLhonda2008crystal). See Figures \ref{['fig:vamp_trpcage_physical']} and \ref{['fig:vamp_villin_physical']} for analogous plots for trp-cage and villin.
  • Figure 5: SPIB for trp-cage. All results are obtained from a GNN with SubFormer-GVP token mixer. (top left) PMF as a function of the first two information bottleneck coordinates (IBs). Contours are drawn every 1 kcal/mol. (top right) Same contours colored by SPIB assigned labels. (bottom) Learned Markov State Model. The highlighted structures are chosen randomly from the trajectory. The N-terminus is violet and the C-terminus is red.
  • ...and 18 more figures