Table of Contents
Fetching ...

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank Noé, Haiguang Liu, Tie-Yan Liu

TL;DR

This work addresses the challenge of predicting equilibrium distributions of molecular systems rather than a single static structure. It introduces Distributional Graphormer (DiG), a diffusion-based framework with a Graphormer backbone that learns reverse diffusion conditioned on molecular descriptors to generate diverse, thermodynamically plausible conformations and estimate state densities. DiG can be trained with data (MD/experimental) or physics-informed diffusion pre-training using energy functions, enabling step-by-step supervision and density computation. The authors demonstrate DiG on protein conformations, ligand poses, catalyst-adsorbate distributions, and carbon polymorph design, achieving MD-like coverage with substantial speedups and enabling inverse design via property conditioning. These results offer a scalable path to macroscopic thermodynamic insights and design capabilities across chemistry and materials science.

Abstract

Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.

Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning

TL;DR

This work addresses the challenge of predicting equilibrium distributions of molecular systems rather than a single static structure. It introduces Distributional Graphormer (DiG), a diffusion-based framework with a Graphormer backbone that learns reverse diffusion conditioned on molecular descriptors to generate diverse, thermodynamically plausible conformations and estimate state densities. DiG can be trained with data (MD/experimental) or physics-informed diffusion pre-training using energy functions, enabling step-by-step supervision and density computation. The authors demonstrate DiG on protein conformations, ligand poses, catalyst-adsorbate distributions, and carbon polymorph design, achieving MD-like coverage with substantial speedups and enabling inverse design via property conditioning. These results offer a scalable path to macroscopic thermodynamic insights and design capabilities across chemistry and materials science.

Abstract

Advances in deep learning have greatly improved structure prediction of molecules. However, many macroscopic observations that are important for real-world applications are not functions of a single molecular structure, but rather determined from the equilibrium distribution of structures. Traditional methods for obtaining these distributions, such as molecular dynamics simulation, are computationally expensive and often intractable. In this paper, we introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. Inspired by the annealing process in thermodynamics, DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system, such as a chemical graph or a protein sequence. This framework enables efficient generation of diverse conformations and provides estimations of state densities. We demonstrate the performance of DiG on several molecular tasks, including protein conformation sampling, ligand structure sampling, catalyst-adsorbate sampling, and property-guided structure generation. DiG presents a significant advancement in methodology for statistically understanding molecular systems, opening up new research opportunities in molecular science.
Paper Structure (54 sections, 56 equations, 13 figures, 6 tables, 11 algorithms)

This paper contains 54 sections, 56 equations, 13 figures, 6 tables, 11 algorithms.

Figures (13)

  • Figure 1: Predicting conformational distributions with the Distributional Graphormer (DiG) framework. (a) DiG takes the basic descriptor $\mathcal{D}$ of a target molecular system as input, e.g., amino acid sequence, to generate a probability distribution of structures which aims at approximating the equilibrium distribution and sampling different metastable states or intermediate states. In contrast, static structure prediction methods, such as AlphaFold jumper2021highly, aim at predicting one single high-probability structure of a molecule. (b) The DiG framework for predicting distributions of molecular structures. A deep-learning model (Graphormer ying2021transformers) is used as modules to predict a diffusion process ($\rightarrow$) that gradually transforms a simple distribution towards the target distribution. The model is learned so that the derived distribution $p_i$ in each intermediate diffusion time step $i$ matches the corresponding distribution $q_i$ in a predefined diffusion process ($\leftarrow$) that is set to transform the equilibrium distribution to the simple distribution. Supervision can be obtained from both samples (lower row), and a molecular energy function (upper row).
  • Figure 2: Distribution and sampling results for protein conformations.
  • Figure 3: (a) Structures generated by DiG resemble the diverse conformations of millisecond MD simulations. MD simulated structures are projected onto the reduced 2D space spanned by TICA coordinates, and the probability densities are depicted using contour lines. For RBD protein, MD simulation reveals four highly populated regions in the 2D space spanned by TICA coordinates (left panel). Structures generated by DiG are mapped to this 2D space shown as orange dots, whose distributions are reflected by the color intensity. Below the distribution map, structures generated by DiG (thin ribbons) are superposed to representative structures of four clusters. AlphaFold predicted structures ($\star$) are also shown in the plot. Right panel shows the results of the main protease of SARS-CoV-2, compared with MD simulations and AlphaFold prediction results. The contour map reveals three clusters, DiG generates highly similar structures in cluster II & III, while structures in cluster-I are accurately generated. (b) The performance of DiG on generating multiple conformations of proteins (each structure is labeled by its PDB ID, except the DEER-AF, which is AlphaFold predicted model that is consistent with experimental observations). Structures generated by DiG (thin ribbons) are compared with the experimentally determined structures (cylindrical cartoons) in each case. For the four proteins (adenylate kinase, Lmrb membrane protein, human B-Raf kinase, and D-ribose binding protein), structures in two functional states (distinguished by cyan and brown) are well reproduced by DiG (ribbons).
  • Figure 4: Results of DiG for ligand structure sampling around protein pockets. (a) The results of DiG on poses of ligands bound to protein pockets. DiG generates ligand structures and binding poses, with good accuracy compared to the crystal structures (reflected by the RMSD statistics shown in red histogram for the best matching cases, and the green histogram for the median RMSD statistics). When considering all 50 predicted binding poses for each system, diversity is observed, as reflected in the RMSD histogram (yellow color, normalized) compared to the references. (b) Representative systems show that the diversity in ligand binding poses is related to the binding pocket properties. For deep and narrow binding pocket such as for the Tyk2 protein (shown in the surface representation, top panel), DiG predicts highly similar binding poses for the ligand (in atom-bond representations, top panel). For the P38 protein the binding pocket is relatively flat and shallow and predicted ligand poses are highly diverse and have large conformational flexibility (bottom panel, in the same representations as in the Tyk2 case).
  • Figure 5: Results of DiG for catalyst-adsorbate sampling problems. (a) The problem setting: prediction of the adsorption configuration distribution of an adsorbate on a catalyst surface. (b) The adsorption sites and corresponding configurations of the adsorbate found by DiG (in color), compared with DFT results (in white). DiG finds all the adsorption sites, with adsorbate structures close to the DFT baseline. For all adsoprtion sites and configurations, refer to Appendix E. (c-f) Adsorption prediction results of single N and O atoms on catalyst surfaces, compared to DFT calculations. Top panels show the catalyst surface; the probability distribution of adsorbate molecules on the corresponding catalyst surfaces are shown in the middle panels in log-scale; the bottom panels show the calculated interactions between the adsorbate molecule and the catalyst using DFT methods. The adsorption sites and predicted probabilities are highly consistent with the energy landscape obtained by DFT computations.
  • ...and 8 more figures