Table of Contents
Fetching ...

Flows, straight but not so fast: Exploring the design space of Rectified Flows in Protein Design

Junhua Chen, Simon Mathis, Charles Harris, Kieran Didi, Pietro Lio

TL;DR

This work tackles the resource bottleneck in de novo protein backbone design using flow-based models on frames in $SE(3)^N$. It extends Rectified Flows (ReFlow) to manifold data and proteins, adapting the coupling generation, training, and inference choices from image-domain practice to the protein domain. The authors demonstrate that ReFlow improves low-NFE designability across data settings but is highly sensitive to coupling generation and inference annealing, with domain-specific discretization and loss configurations delivering large gains. They also show that several image-domain improvements do not translate to proteins and propose guidelines for when to deploy ReFlow versus simpler fine-tuning, highlighting multimodality as a key factor. The findings offer practical routes to faster, designable protein backbone generation at scale.

Abstract

Generative modeling techniques such as Diffusion and Flow Matching have achieved significant successes in generating designable and diverse protein backbones. However, many current models are computationally expensive, requiring hundreds or even thousands of function evaluations (NFEs) to yield samples of acceptable quality, which can become a bottleneck in practical design campaigns that often generate $10^4\ -\ 10^6$ designs per target. In image generation, Rectified Flows (ReFlow) can significantly reduce the required NFEs for a given target quality, but their application in protein backbone generation has been less studied. We apply ReFlow to improve the low NFE performance of pretrained SE(3) flow matching models for protein backbone generation and systematically study ReFlow design choices in the context of protein generation in data curation, training and inference time settings. In particular, we (1) show that ReFlow in the protein domain is particularly sensitive to the choice of coupling generation and annealing, (2) demonstrate how useful design choices for ReFlow in the image domain do not directly translate to better performance on proteins, and (3) make improvements to ReFlow methodology for proteins.

Flows, straight but not so fast: Exploring the design space of Rectified Flows in Protein Design

TL;DR

This work tackles the resource bottleneck in de novo protein backbone design using flow-based models on frames in . It extends Rectified Flows (ReFlow) to manifold data and proteins, adapting the coupling generation, training, and inference choices from image-domain practice to the protein domain. The authors demonstrate that ReFlow improves low-NFE designability across data settings but is highly sensitive to coupling generation and inference annealing, with domain-specific discretization and loss configurations delivering large gains. They also show that several image-domain improvements do not translate to proteins and propose guidelines for when to deploy ReFlow versus simpler fine-tuning, highlighting multimodality as a key factor. The findings offer practical routes to faster, designable protein backbone generation at scale.

Abstract

Generative modeling techniques such as Diffusion and Flow Matching have achieved significant successes in generating designable and diverse protein backbones. However, many current models are computationally expensive, requiring hundreds or even thousands of function evaluations (NFEs) to yield samples of acceptable quality, which can become a bottleneck in practical design campaigns that often generate designs per target. In image generation, Rectified Flows (ReFlow) can significantly reduce the required NFEs for a given target quality, but their application in protein backbone generation has been less studied. We apply ReFlow to improve the low NFE performance of pretrained SE(3) flow matching models for protein backbone generation and systematically study ReFlow design choices in the context of protein generation in data curation, training and inference time settings. In particular, we (1) show that ReFlow in the protein domain is particularly sensitive to the choice of coupling generation and annealing, (2) demonstrate how useful design choices for ReFlow in the image domain do not directly translate to better performance on proteins, and (3) make improvements to ReFlow methodology for proteins.

Paper Structure

This paper contains 31 sections, 1 theorem, 5 equations, 5 figures, 10 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $(X_0, X_1)$ be the coupling used to train the rectified flow and $(Z_0, Z_1)$ be the coupling induced by the rectified model. Then under the assumptions of Theorem 2 in wu2025riemannianneuralgeodesicinterpolant we have where $d_g$ is the geodesic distance induced by some Riemannian metric on the manifold.

Figures (5)

  • Figure 1: Inference annealing settings affect secondary structure diversity through ReFlow coupling selection. Secondary structure distributions for QFlow show that models rectified on unannealed samples exhibit wider support and improved diversity (Table \ref{['tab:dataimpactreqflow']}), demonstrating how ReFlow coupling choice can shift the modeled distribution.
  • Figure 2: Secondary structure distribution for FoldFlow-OT under different inference annealing settings (top) and that of models rectified on the corresponding coupling (bottom). The Rectified models were sampled with Inference Annealing=10. The secondary structure statistics of the rectified model is also strongly influenced by the fine-tuning coupling, with greater diversity in the fine-tuning distribution also translating to greater diversity in the fine-tuned model, although there is still some bias towards helical structures from the base model (FoldFlow-OT). This highlights the sensitivity of protein models to the fine-tuning distribution, simultaneously underlining the need to exercise caution when choosing a coupling to apply ReFlow on that is representative of the desired protein distribution, as well as potential opportunities in fine-tuning protein flow matching models on curated data.
  • Figure 3: FoldFlow backwards integration produces non-Gaussian latents, violating model assumptions. Distribution of $p$-values (log scale) from Kolmogorov-Smirnov tests shows that 96% of latents generated by backwards integration have $p < 0.005$, while synthetic centered Gaussian latents achieve high $p$-values (minimum 0.029 across 10,000 samples).
  • Figure 4: FoldFlow-OT exhibits complex dynamics near $t=0$ with significant trajectory curvature and rapid frame changes. Visualization of velocities and positions from 30 random Euclidean coordinates across 50 proteins of varying lengths (100-300 residues, $c=10$ inference annealing) shows substantial curvature in coordinate trajectories near the noise regime, while frame component velocity magnitudes decay significantly after $t\approx 0.3$.
  • Figure 5: We repeat the visualization for Fig.\ref{['fig:FoldFlowOTDynamics']} for our rectified model. The position coordinate trajectories of our rectified model shows significantly less curvature especially near $t=0$ and look almost indistinguishable from straight lines.

Theorems & Definitions (2)

  • Proposition 3.1
  • proof