Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators
Panagiotis Antoniadis, Beatrice Pavesi, Simon Olsson, Ole Winther
TL;DR
This work tackles the expensive sampling problem in molecular dynamics by learning transferable, coarse-grained implicit transfer operators (TITO) for long-timescale dynamics. It introduces PLaTITO, which conditions a flow-matching-based surrogate on multiple sources of auxiliary information, including sequence embeddings from protein language models, structure embeddings, and LLM-derived annotations, with a dedicated large-capacity variant PLaTITO-Big. The results show state-of-the-art equilibrium sampling for out-of-distribution proteins, improved data efficiency, and the emergence of non-Arrhenius temperature dependence in folding/unfolding rates, demonstrating physically meaningful kinetics. Together, these findings highlight the potential of leveraging pre-trained biological representations to accelerate MD surrogate modeling and enable scalable exploration of protein dynamics at reduced computational cost.
Abstract
Molecular dynamics (MD) is a central computational tool in physics, chemistry, and biology, enabling quantitative prediction of experimental observables as expectations over high-dimensional molecular distributions such as Boltzmann distributions and transition densities. However, conventional MD is fundamentally limited by the high computational cost required to generate independent samples. Generative molecular dynamics (GenMD) has recently emerged as an alternative, learning surrogates of molecular distributions either from data or through interaction with energy models. While these methods enable efficient sampling, their transferability across molecular systems is often limited. In this work, we show that incorporating auxiliary sources of information can improve the data efficiency and generalization of transferable implicit transfer operators (TITO) for molecular dynamics. We find that coarse-grained TITO models are substantially more data-efficient than Boltzmann Emulators, and that incorporating protein language model (pLM) embeddings further improves out-of-distribution generalization. Our approach, PLaTITO, achieves state-of-the-art performance on equilibrium sampling benchmarks for out-of-distribution protein systems, including fast-folding proteins. We further study the impact of additional conditioning signals -- such as structural embeddings, temperature, and large-language-model-derived embeddings -- on model performance.
