Table of Contents
Fetching ...

Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures

Ce Liu, Jun Wang, Zhiqiang Cai, Yingxu Wang, Huizhen Kuang, Kaihui Cheng, Liwei Zhang, Qingkun Su, Yining Tang, Fenglei Cao, Limei Han, Siyu Zhu, Yuan Qi

TL;DR

This work introduces Dynamic PDB, a large-scale dynamic protein dataset built from all-atom MD simulations of 12,643 proteins at $1\,\mathrm{ps}$ intervals for $1\,\mathrm{\mu s}$ each, including velocities, forces, energies, and environment temperature. It extends SE(3) diffusion models by conditioning on amino-acid sequences and rich physical properties, enabling more accurate trajectory prediction. Through experiments on two target proteins, the physics-conditioned model consistently lowers MAE and RMSD compared with several baselines and demonstrates improved generalization with larger training sets. The dataset and physics-informed diffusion framework advance the study of protein dynamics and provide a foundation for more realistic, dynamically aware structure predictions.

Abstract

Despite significant progress in static protein structure collection and prediction, the dynamic behavior of proteins, one of their most vital characteristics, has been largely overlooked in prior research. This oversight can be attributed to the limited availability, diversity, and heterogeneity of dynamic protein datasets. To address this gap, we propose to enhance existing prestigious static 3D protein structural databases, such as the Protein Data Bank (PDB), by integrating dynamic data and additional physical properties. Specifically, we introduce a large-scale dataset, Dynamic PDB, encompassing approximately 12.6K proteins, each subjected to all-atom molecular dynamics (MD) simulations lasting 1 microsecond to capture conformational changes. Furthermore, we provide a comprehensive suite of physical properties, including atomic velocities and forces, potential and kinetic energies of proteins, and the temperature of the simulation environment, recorded at 1 picosecond intervals throughout the simulations. For benchmarking purposes, we evaluate state-of-the-art methods on the proposed dataset for the task of trajectory prediction. To demonstrate the value of integrating richer physical properties in the study of protein dynamics and related model design, we base our approach on the SE(3) diffusion model and incorporate these physical properties into the trajectory prediction process. Preliminary results indicate that this straightforward extension of the SE(3) model yields improved accuracy, as measured by MAE and RMSD, when the proposed physical properties are taken into consideration. https://fudan-generative-vision.github.io/dynamicPDB/ .

Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures

TL;DR

This work introduces Dynamic PDB, a large-scale dynamic protein dataset built from all-atom MD simulations of 12,643 proteins at intervals for each, including velocities, forces, energies, and environment temperature. It extends SE(3) diffusion models by conditioning on amino-acid sequences and rich physical properties, enabling more accurate trajectory prediction. Through experiments on two target proteins, the physics-conditioned model consistently lowers MAE and RMSD compared with several baselines and demonstrates improved generalization with larger training sets. The dataset and physics-informed diffusion framework advance the study of protein dynamics and provide a foundation for more realistic, dynamically aware structure predictions.

Abstract

Despite significant progress in static protein structure collection and prediction, the dynamic behavior of proteins, one of their most vital characteristics, has been largely overlooked in prior research. This oversight can be attributed to the limited availability, diversity, and heterogeneity of dynamic protein datasets. To address this gap, we propose to enhance existing prestigious static 3D protein structural databases, such as the Protein Data Bank (PDB), by integrating dynamic data and additional physical properties. Specifically, we introduce a large-scale dataset, Dynamic PDB, encompassing approximately 12.6K proteins, each subjected to all-atom molecular dynamics (MD) simulations lasting 1 microsecond to capture conformational changes. Furthermore, we provide a comprehensive suite of physical properties, including atomic velocities and forces, potential and kinetic energies of proteins, and the temperature of the simulation environment, recorded at 1 picosecond intervals throughout the simulations. For benchmarking purposes, we evaluate state-of-the-art methods on the proposed dataset for the task of trajectory prediction. To demonstrate the value of integrating richer physical properties in the study of protein dynamics and related model design, we base our approach on the SE(3) diffusion model and incorporate these physical properties into the trajectory prediction process. Preliminary results indicate that this straightforward extension of the SE(3) model yields improved accuracy, as measured by MAE and RMSD, when the proposed physical properties are taken into consideration. https://fudan-generative-vision.github.io/dynamicPDB/ .
Paper Structure (47 sections, 16 figures, 7 tables)

This paper contains 47 sections, 16 figures, 7 tables.

Figures (16)

  • Figure 1: The conformational evolution and statistics of protein 3TVJ_I from proposed dataset. a) The regions with the most significant changes in the RMSD (Root Mean Square Deviation) and radius of gyration curves over time correspond to potential conformational changes, as depicted in the upper part of the figure. b) The contact map frequency illustrates the changes in interactions between residues within the protein. c) The Ramachandran plot provides insight into the dihedral angles of the protein backbone, indicating the structural validity of the protein conformation.
  • Figure 2: RMSF comparison between the proposed dataset and ATLAS reveals similar residue fluctuations, effectively capturing the intrinsic dynamics of proteins.
  • Figure 3: RMSD plots for our dataset and ATLAS. Longer simulation time can potentially capture more protein conformational changes, which are indicated by the red arrows.
  • Figure 4: Visualization of our protein trajectories, stored with higher temporal resolution, offers a more detailed depiction of the protein's trajectories.
  • Figure 5: Overall architecture of our network. We first extract features by amino acid encoder and physical properties encoder respectively. Then we refine node features by IPA, and concatenate with the physical condition embedding. After 2D convolution operation, we predict the updated node features, torsion angles, and transformations.
  • ...and 11 more figures