Table of Contents
Fetching ...

Learning conformational ensembles of proteins based on backbone geometry

Nicolas Wolf, Leif Seute, Vsevolod Viliuga, Simon Wagner, Jan Stühmer, Frauke Gräter

TL;DR

This work tackles efficient sampling of protein conformational ensembles from Boltzmann distributions without relying on evolutionary information. It introduces BBFlow, a conditional SE(3) flow-matching model that encodes the equilibrium backbone geometry and uses a geodesic-based conditional prior to generate samples from p(x|x_eq). BBFlow achieves orders-of-magnitude faster inference than state-of-the-art MD emulators, generalizes to multi-chain proteins, and performs well on both natural and de novo proteins while being trainable from scratch in a few GPU days. Limitations include dependence on an initial structure, restricted ability to capture rare, long-timescale events, and incomplete modeling of sidechains, but the approach offers a practical, scalable path for MD-accurate dynamics in design pipelines and large-scale screenings.

Abstract

Deep generative models have recently been proposed for sampling protein conformations from the Boltzmann distribution, as an alternative to often prohibitively expensive Molecular Dynamics simulations. However, current state-of-the-art approaches rely on fine-tuning pre-trained folding models and evolutionary sequence information, limiting their applicability and efficiency, and introducing potential biases. In this work, we propose a flow matching model for sampling protein conformations based solely on backbone geometry - BBFlow. We introduce a geometric encoding of the backbone equilibrium structure as input and propose to condition not only the flow but also the prior distribution on the respective equilibrium structure, eliminating the need for evolutionary information. The resulting model is orders of magnitudes faster than current state-of-the-art approaches at comparable accuracy, is transferable to multi-chain proteins, and can be trained from scratch in a few GPU days. In our experiments, we demonstrate that the proposed model achieves competitive performance with reduced inference time, across not only an established benchmark of naturally occurring proteins but also de novo proteins, for which evolutionary information is scarce or absent. BBFlow is available at https://github.com/graeter-group/bbflow.

Learning conformational ensembles of proteins based on backbone geometry

TL;DR

This work tackles efficient sampling of protein conformational ensembles from Boltzmann distributions without relying on evolutionary information. It introduces BBFlow, a conditional SE(3) flow-matching model that encodes the equilibrium backbone geometry and uses a geodesic-based conditional prior to generate samples from p(x|x_eq). BBFlow achieves orders-of-magnitude faster inference than state-of-the-art MD emulators, generalizes to multi-chain proteins, and performs well on both natural and de novo proteins while being trainable from scratch in a few GPU days. Limitations include dependence on an initial structure, restricted ability to capture rare, long-timescale events, and incomplete modeling of sidechains, but the approach offers a practical, scalable path for MD-accurate dynamics in design pipelines and large-scale screenings.

Abstract

Deep generative models have recently been proposed for sampling protein conformations from the Boltzmann distribution, as an alternative to often prohibitively expensive Molecular Dynamics simulations. However, current state-of-the-art approaches rely on fine-tuning pre-trained folding models and evolutionary sequence information, limiting their applicability and efficiency, and introducing potential biases. In this work, we propose a flow matching model for sampling protein conformations based solely on backbone geometry - BBFlow. We introduce a geometric encoding of the backbone equilibrium structure as input and propose to condition not only the flow but also the prior distribution on the respective equilibrium structure, eliminating the need for evolutionary information. The resulting model is orders of magnitudes faster than current state-of-the-art approaches at comparable accuracy, is transferable to multi-chain proteins, and can be trained from scratch in a few GPU days. In our experiments, we demonstrate that the proposed model achieves competitive performance with reduced inference time, across not only an established benchmark of naturally occurring proteins but also de novo proteins, for which evolutionary information is scarce or absent. BBFlow is available at https://github.com/graeter-group/bbflow.

Paper Structure

This paper contains 59 sections, 14 equations, 12 figures, 15 tables, 1 algorithm.

Figures (12)

  • Figure 1: Schematic representation of BBFlow. The equilibrium backbone structure $x_\text{eq}$ of an input protein is used to condition an SE(3) Flow Matching model on the generation of protein backbone conformations $x_{1}$. Already the prior $p_0$ of the flow matching process is conditioned on the input protein via partial geodesic interpolation between pure noise and the equilibrium backbone structure.
  • Figure 2: (A) Performance of BBFlow, AlphaFlow-T and AlphaFlow-T$_\text{12L,dist}$ on the ATLAS test set for different protein lengths. We divide the protein lengths in three bins and calculate RMSF MAE, the absolute error of pairwise RMSD and PCA $\mathcal{W}_2$ of each protein (lower is better) with length in the respective bin. The boxes depict the 0.25 and 0.75 quantile, minimum, maximum and median of all test proteins. We also show inference time per generated conformation as function of protein length in log-scale, spanning several orders of magnitude. (B) RMSF profiles of de novo proteins. We show structures and RMSF profiles predicted by BBFlow and MD of four selected proteins from the dataset of de novo proteins along with Pearson correlation $r$ and MAE as reported in Tab. \ref{['tab:de_novo']}.
  • Figure 3: Ensembles of two de novo proteins predicted by BBFlow, AlphaFlow-T (AF-T), AlphaFlow (AF) and BioEmu compared with the ground truth molecular dynamics (MD) simulation. The proteins were generated by RFDiffusion and ProteinMPNN, and are colored by residue index.
  • Figure 4: BBlow is applicable to multi-chain proteins. Dynamic cross-correlation matrices (DCCM) of conformational ensembles computed either with MD (upper triangle) or with BBFlow (lower triangle) for three protein dimers. Chain boundaries are indicated by black lines within matrices. $r$: Pearson correlation between entries of the triangular matrices. We show RMSF profiles in Fig. \ref{['fig:app_rmsf_multimers']}.
  • Figure 5: Trade-off between accuracy and speed of MD emulation. While other methods are either efficient or accurate, BBFlow performs well at both. The accuracy metric RMSF MAE and inference time are averaged over the ATLAS test set. More metrics can be found in Fig. \ref{['fig:app_full_tradeoff']}. BBFlow-light: App. \ref{['sec:app_bbflow_light']}.
  • ...and 7 more figures