Table of Contents
Fetching ...

Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators

Panagiotis Antoniadis, Beatrice Pavesi, Simon Olsson, Ole Winther

TL;DR

This work tackles the expensive sampling problem in molecular dynamics by learning transferable, coarse-grained implicit transfer operators (TITO) for long-timescale dynamics. It introduces PLaTITO, which conditions a flow-matching-based surrogate on multiple sources of auxiliary information, including sequence embeddings from protein language models, structure embeddings, and LLM-derived annotations, with a dedicated large-capacity variant PLaTITO-Big. The results show state-of-the-art equilibrium sampling for out-of-distribution proteins, improved data efficiency, and the emergence of non-Arrhenius temperature dependence in folding/unfolding rates, demonstrating physically meaningful kinetics. Together, these findings highlight the potential of leveraging pre-trained biological representations to accelerate MD surrogate modeling and enable scalable exploration of protein dynamics at reduced computational cost.

Abstract

Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high-dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals -- such as structural embeddings, temperature, and large-language-model-derived embeddings -- on model performance.

Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators

TL;DR

This work tackles the expensive sampling problem in molecular dynamics by learning transferable, coarse-grained implicit transfer operators (TITO) for long-timescale dynamics. It introduces PLaTITO, which conditions a flow-matching-based surrogate on multiple sources of auxiliary information, including sequence embeddings from protein language models, structure embeddings, and LLM-derived annotations, with a dedicated large-capacity variant PLaTITO-Big. The results show state-of-the-art equilibrium sampling for out-of-distribution proteins, improved data efficiency, and the emergence of non-Arrhenius temperature dependence in folding/unfolding rates, demonstrating physically meaningful kinetics. Together, these findings highlight the potential of leveraging pre-trained biological representations to accelerate MD surrogate modeling and enable scalable exploration of protein dynamics at reduced computational cost.

Abstract

Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high-dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals -- such as structural embeddings, temperature, and large-language-model-derived embeddings -- on model performance.
Paper Structure (39 sections, 5 equations, 12 figures, 11 tables, 3 algorithms)

This paper contains 39 sections, 5 equations, 12 figures, 11 tables, 3 algorithms.

Figures (12)

  • Figure 1: PLaTITO generalizes to unseen protein systems while improving data efficiency. Given the molecular state of a protein system at physical time $t$, defined by backbone coordinates $x_t$, amino-acid sequence $S$ and temperature $T$, our proposed TITO models approximate the long-time transition density $p(x_{t+\Delta t} \mid x_t, S, T, \Delta t)$ for a given time step $\Delta t$. To improve data efficiency, auxiliary representations are incorporated during training including pretrained sequence embeddings from ESM, pretrained structure embeddings from Proteina and LLM-derived annotations $A_{LLM}$. Iterative sampling of the learned transition model enables sampling of protein conformational dynamics at increasing timescales approaching the equilibrium distribution of the MD.
  • Figure 2: Test-time predictions of free energy landscapes of three fast-folders. Free energy surfaces projected into the two slowest TICA components of Villin, WW domain and A3D. PLaTITO-Big (middle) accurately reproduces the MD reference distributions (left) and exceeds the performance of BioEmu (right). Squares ($\square$) and triangles ($\triangle$) denote folded and unfolded states, respectively, with PLaTITO-Big trajectories initialized from the unfolded state. Results for all fast-folding proteins are shown in Appendix \ref{['supplementary:equilibrium_sampling']}.
  • Figure 3: Scalability of PLaTITO-Big with training compute. Equilibrium sampling metrics improve as training compute increases, indicating effective scaling behavior. The red dashed line corresponds to the performance of BioEmu. Notably, PLaTITO-Big converges within approximately 1,100 GPU hours, that is substantially less than the training cost required by BioEmu (9,216 GPU hours) highlighting the computational efficiency of TITO models compared to Boltzmann Emulators.
  • Figure 4: Top: Free-energy surfaces projected into the two slowest TICA components of A3D estimated by PLaTITO-Big from 1,000 independent trajectories initialized either from unfolded (top row) or folded (middle) conformations. Distributions are shown at increasing rollout times (left to right) and compared to the MD reference distribution (rightmost column). Below: Time-trace of a long 120 µs trajectory by PLaTITO-Big projected in the the slowest TICA component, illustrating repeated folding and unfolding events. Results for all fast-folding proteins are shown in Appendix \ref{['supplementary:time_trace_all']}.
  • Figure 5: PLaTITO-Big recovers non-Arrhenius folding and unfolding rates. Folding (left) and unfolding (right) timescales predicted by PLaTITO-Big are shown as a function of inverse temperature for BBA (top) and Villin (bottom). The predicted rates exhibit clear deviations from simple Arrhenius behavior indicating that the learned temperature-conditioned dynamics capture physically meaningful kinetic trends. Reference rates estimated from MD simulations are shown as red squares.
  • ...and 7 more figures