Table of Contents
Fetching ...

Improved motif-scaffolding with SE(3) flow matching

Jason Yim, Andrew Campbell, Emile Mathieu, Andrew Y. K. Foong, Michael Gastegger, José Jiménez-Luna, Sarah Lewis, Victor Garcia Satorras, Bastiaan S. Veeling, Frank Noé, Regina Barzilay, Tommi S. Jaakkola

TL;DR

This work extends the SE(3) flow matching framework FrameFlow to motif-scaffolding through two complementary strategies: motif amortization, which trains a motif-conditioned scaffold generator, and motif guidance, which repurposes an unconditional model with motif-driven trajectory guidance. On a 24-motif benchmark, the approach achieves substantially higher scaffold diversity and up to 2.5x better designability and uniqueness compared with state-of-the-art diffusion-based methods, while remaining faster to sample. The authors also introduce a data-augmentation scheme to simulate motif–scaffold pairings from unlabeled PDB data, enabling robust generalization to novel motifs. Overall, the method demonstrates that diversity-aware motif-scaffolding is feasible with SE(3) flow matching and provides practical pathways for more reliable wet-lab validation. The work positions FrameFlow as a lighter, faster alternative to heavier diffusion models with competitive designability and notably improved scaffold diversity, paving the way for broader motif-based protein design tasks, including potential extension to binders and enzymes.

Abstract

Protein design often begins with the knowledge of a desired function from a motif which motif-scaffolding aims to construct a functional protein around. Recently, generative models have achieved breakthrough success in designing scaffolds for a range of motifs. However, generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation. In this work, we extend FrameFlow, an SE(3) flow matching model for protein backbone generation, to perform motif-scaffolding with two complementary approaches. The first is motif amortization, in which FrameFlow is trained with the motif as input using a data augmentation strategy. The second is motif guidance, which performs scaffolding using an estimate of the conditional score from FrameFlow without additional training. On a benchmark of 24 biologically meaningful motifs, we show our method achieves 2.5 times more designable and unique motif-scaffolds compared to state-of-the-art. Code: https://github.com/microsoft/protein-frame-flow

Improved motif-scaffolding with SE(3) flow matching

TL;DR

This work extends the SE(3) flow matching framework FrameFlow to motif-scaffolding through two complementary strategies: motif amortization, which trains a motif-conditioned scaffold generator, and motif guidance, which repurposes an unconditional model with motif-driven trajectory guidance. On a 24-motif benchmark, the approach achieves substantially higher scaffold diversity and up to 2.5x better designability and uniqueness compared with state-of-the-art diffusion-based methods, while remaining faster to sample. The authors also introduce a data-augmentation scheme to simulate motif–scaffold pairings from unlabeled PDB data, enabling robust generalization to novel motifs. Overall, the method demonstrates that diversity-aware motif-scaffolding is feasible with SE(3) flow matching and provides practical pathways for more reliable wet-lab validation. The work positions FrameFlow as a lighter, faster alternative to heavier diffusion models with competitive designability and notably improved scaffold diversity, paving the way for broader motif-based protein design tasks, including potential extension to binders and enzymes.

Abstract

Protein design often begins with the knowledge of a desired function from a motif which motif-scaffolding aims to construct a functional protein around. Recently, generative models have achieved breakthrough success in designing scaffolds for a range of motifs. However, generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation. In this work, we extend FrameFlow, an SE(3) flow matching model for protein backbone generation, to perform motif-scaffolding with two complementary approaches. The first is motif amortization, in which FrameFlow is trained with the motif as input using a data augmentation strategy. The second is motif guidance, which performs scaffolding using an estimate of the conditional score from FrameFlow without additional training. On a benchmark of 24 biologically meaningful motifs, we show our method achieves 2.5 times more designable and unique motif-scaffolds compared to state-of-the-art. Code: https://github.com/microsoft/protein-frame-flow
Paper Structure (34 sections, 30 equations, 11 figures, 3 tables)

This paper contains 34 sections, 30 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: We present two strategies for motif-scaffolding. Top: motif amortization trains a flow model to condition on the motif (blue) and generate the scaffold (red). During training, only the scaffold is corrupted with noise. Bottom: motif guidance re-purposes a flow model that is trained to generate the full protein for motif-scaffolding. During generation, the motif residues are guided to reconstruct the true motif at $t=1$ while the flow model will adjust the scaffold trajectory to be consistent with the motif.
  • Figure 2: Motif data augmentation. Each protein in the dataset does not come with pre-defined motif-scaffold annotations. Instead, we construct plausible motifs at random to simulate sampling from the distribution of motifs and scaffolds.
  • Figure 3: Motif-scaffolding results. Top plot: RFdiffusion achieves the most designable scaffolds amongst all methods in 9/24 test motifs compared to FrameFlow-amortization’s 7/24 and TDS’ 6/24; 2/24 are ties. Bottom plot: However, we observe that RFdiffusion produces the highest number of unique designable scaffolds for only 2 out of the 24 test motifs. Therefore, previous approaches that only measure designability (top plot) may be misleading since those generative models that may have the best designability can also be repeatedly sampling similar scaffolds. This demonstrates the need to measure diversity alongside designability and use the number of unique designable scaffolds as the metric of success.
  • Figure 4: FrameFlow-amortization diversity. In blue is the motif while red is the scaffold. For each motif (1QJG, 1YCR, 5TPN), we show FrameFlow-amortization can generate scaffolds of different lengths and various secondary structure elements for the same motif. Each scaffold is in a unique cluster to showcase the samples'structural diversity.
  • Figure 5: Secondary structure analysis. 2D kernel density plots of secondary structure composition of designable motif-scaffolds from FrameFlow-amortization and RFdiffusion. Here we see RFdiffusion tends to mostly generate helical scaffolds while FrameFlow-amortization gets much more scaffolds with strands.
  • ...and 6 more figures