Table of Contents
Fetching ...

ProTDyn: a foundation Protein language model for Thermodynamics and Dynamics generation

Yikai Liu, Haoyang Zheng, Lining Mao, Yanbin Wang, Ming Chen, Guang Lin

TL;DR

ProTDyn tackles the MD bottleneck by unifying thermodynamics and multi-timescale dynamics in a single transformer-based generative model that tokenizes conformations with a structure tokenizer and learns three objectives: $L_{thermo}$, $L_{dyn}$, and $L_{dynI}$ with $L_{ProTDyn}=\omega_1 L_{thermo}+\omega_2 L_{dyn}+\omega_3 L_{dynI}$. Empirical results show Boltzmann-consistent ensembles and accurate long-timescale dynamics, with strong generalization to unseen proteins and performance comparable to reference MD while enabling scalable generation. The framework supports exact likelihood evaluation and offers a path toward integrating physics-based energy functions and enforcing principles like detailed balance, advancing principled, physically grounded protein modeling. Overall, ProTDyn provides a scalable, transferable approach that bridges thermodynamics and dynamics within a single generative model, enabling efficient exploration of protein conformational landscapes across multiple timescales.

Abstract

Molecular dynamics (MD) simulation has long been the principal computational tool for exploring protein conformational landscapes and dynamics, but its application is limited by high computational cost. We present ProTDyn, a foundation protein language model that unifies conformational ensemble generation and multi-timescale dynamics modeling within a single framework. Unlike prior approaches that treat these tasks separately, ProTDyn allows flexible independent and identically distributed (i.i.d.) ensemble sampling and dynamic trajectory simulation. Across diverse protein systems, ProTDyn yields thermodynamically consistent ensembles, faithfully reproduces dynamical properties over multiple timescales, and generalizes to proteins beyond its training data. It offers a scalable and efficient alternative to conventional MD simulations.

ProTDyn: a foundation Protein language model for Thermodynamics and Dynamics generation

TL;DR

ProTDyn tackles the MD bottleneck by unifying thermodynamics and multi-timescale dynamics in a single transformer-based generative model that tokenizes conformations with a structure tokenizer and learns three objectives: , , and with . Empirical results show Boltzmann-consistent ensembles and accurate long-timescale dynamics, with strong generalization to unseen proteins and performance comparable to reference MD while enabling scalable generation. The framework supports exact likelihood evaluation and offers a path toward integrating physics-based energy functions and enforcing principles like detailed balance, advancing principled, physically grounded protein modeling. Overall, ProTDyn provides a scalable, transferable approach that bridges thermodynamics and dynamics within a single generative model, enabling efficient exploration of protein conformational landscapes across multiple timescales.

Abstract

Molecular dynamics (MD) simulation has long been the principal computational tool for exploring protein conformational landscapes and dynamics, but its application is limited by high computational cost. We present ProTDyn, a foundation protein language model that unifies conformational ensemble generation and multi-timescale dynamics modeling within a single framework. Unlike prior approaches that treat these tasks separately, ProTDyn allows flexible independent and identically distributed (i.i.d.) ensemble sampling and dynamic trajectory simulation. Across diverse protein systems, ProTDyn yields thermodynamically consistent ensembles, faithfully reproduces dynamical properties over multiple timescales, and generalizes to proteins beyond its training data. It offers a scalable and efficient alternative to conventional MD simulations.

Paper Structure

This paper contains 11 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An illustrative framework of ProTDyn. ProTDyn is a protein language model that operates on discretized representations of protein sequence and structure. It leverages a powerful autoregressive transformer architecture to simultaneously perform three tasks: (i) equilibrium conformational ensemble generation (thermodynamics), (ii) forward trajectory generation across multiple timescales (dynamics), and (iii) recovery of fine-grained trajectories from coarse trajectories (dynamics inpainting).
  • Figure 2: Free energy surface along the top two TICA components, parameterized from the backbone torsion angles of reference MD simulations. The TICA projection is then applied to conformational ensembles generated by the three sampling modules of ProTDyn: (1) "Thermodynamics", (2) "Dynamics (100 ns)", and (3) "Dynamics (10 ns)", as well as to ensembles generated by the baseline model BioEmu.
  • Figure 3: Representative conformational metastable states and dynamic transition pathways illustrated on the 2D TICA free energy surface (FES) for a CATH1 protein system: 1b43A02.
  • Figure 4: Autocorrelation of the top two TICA components from 0 to 800 ns lag time, evaluated on four test CATH1 proteins using reference MD trajectories and dynamic trajectories generated by the two dynamic sampling modules of ProTDyn.