Table of Contents
Fetching ...

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

Guillaume Huguet, James Vuckovic, Kilian Fatras, Eric Thibodeau-Laufer, Pablo Lemos, Riashat Islam, Cheng-Hao Liu, Jarrid Rector-Brooks, Tara Akhound-Sadegh, Michael Bronstein, Alexander Tong, Avishek Joey Bose

TL;DR

This work introduces FoldFlow-2, a sequence-conditioned SE(3)-equivariant flow model for protein backbone generation that fuses structure and sequence information through a multi-modal encoder and a geometric decoder. By leveraging a large pretrained protein language model (ESM2) and Reinforced Fine-Tuning (ReFT), FoldFlow-2 achieves state-of-the-art unconditional designability, novelty, and diversity, while enabling conditional tasks such as folding sequences and motif scaffolding. The approach is evaluated across unconditional generation, motif scaffolding, and zero-shot equilibrium conformation sampling, showing strong performance and generalization, and it demonstrates practical potential for de novo drug design and targeted protein engineering. Limitations include reliance on the quality of the pretrained language model and synthetic data filtering, which may affect downstream evaluation with fold predictors.

Abstract

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.

Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation

TL;DR

This work introduces FoldFlow-2, a sequence-conditioned SE(3)-equivariant flow model for protein backbone generation that fuses structure and sequence information through a multi-modal encoder and a geometric decoder. By leveraging a large pretrained protein language model (ESM2) and Reinforced Fine-Tuning (ReFT), FoldFlow-2 achieves state-of-the-art unconditional designability, novelty, and diversity, while enabling conditional tasks such as folding sequences and motif scaffolding. The approach is evaluated across unconditional generation, motif scaffolding, and zero-shot equilibrium conformation sampling, showing strong performance and generalization, and it demonstrates practical potential for de novo drug design and targeted protein engineering. Limitations include reliance on the quality of the pretrained language model and synthetic data filtering, which may affect downstream evaluation with fold predictors.

Abstract

Proteins are essential for almost all biological processes and derive their diverse functions from complex 3D structures, which are in turn determined by their amino acid sequences. In this paper, we exploit the rich biological inductive bias of amino acid sequences and introduce FoldFlow-2, a novel sequence-conditioned SE(3)-equivariant flow matching model for protein structure generation. FoldFlow-2 presents substantial new architectural features over the previous FoldFlow family of models including a protein large language model to encode sequence, a new multi-modal fusion trunk that combines structure and sequence representations, and a geometric transformer based decoder. To increase diversity and novelty of generated samples -- crucial for de-novo drug design -- we train FoldFlow-2 at scale on a new dataset that is an order of magnitude larger than PDB datasets of prior works, containing both known proteins in PDB and high-quality synthetic structures achieved through filtering. We further demonstrate the ability to align FoldFlow-2 to arbitrary rewards, e.g. increasing secondary structures diversity, by introducing a Reinforced Finetuning (ReFT) objective. We empirically observe that FoldFlow-2 outperforms previous state-of-the-art protein structure-based generative models, improving over RFDiffusion in terms of unconditional generation across all metrics including designability, diversity, and novelty across all protein lengths, as well as exhibiting generalization on the task of equilibrium conformation sampling. Finally, we demonstrate that a fine-tuned FoldFlow-2 makes progress on challenging conditional design tasks such as designing scaffolds for the VHH nanobody.
Paper Structure (36 sections, 5 equations, 13 figures, 18 tables)

This paper contains 36 sections, 5 equations, 13 figures, 18 tables.

Figures (13)

  • Figure 1: FoldFlow-2 architecture which processes sequence and structure and outputs $\mathrm{SE(3)}^{ N}_{ 0}$ vectorfields.
  • Figure 2: Uncurated designable (scRMSD $< 2\textup{\AA}$) length 100 structures with ESMFold refolded structure from FoldFlow-2 and RFDiffusion colored by secondary structure assignment. FoldFlow-2 is significantly more diverse in terms of secondary structure composition where we see RFDiffusion generates mostly $\alpha$-helices.
  • Figure 3: Distribution of secondary structure elements ($\alpha$-helices, $\beta$-sheets, and coils) of designable (scRMSD $< 2.0$) proteins generated by various models. FoldFlow-2 generates more diverse designable backbones.
  • Figure 4: Protein conformation ensembles from the ATLAS dataset, ESMFlow-MD and FoldFlow-2. Proteins are colored by their secondary structure with $\alpha$-helices in blue, $\beta$-sheets in red, and coils in green.
  • Figure 5: Analysis of global pLDDT distribution on a sample of 500 proteins from SwissProt.
  • ...and 8 more figures