Table of Contents
Fetching ...

ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu

TL;DR

ProtFlow tackles the high-cost, slow-generation problem in protein design by applying flow matching to a compressed, semantically meaningful latent space derived from protein language models. By redesigning the latent space and employing Rectified Flow with the Reflow technique, it achieves fast, near one-shot generation while maintaining high sequence quality and structural plausibility. The framework is demonstrated across general peptides, long-chain proteins, antimicrobial peptides, and antibodies, where it outperforms task-specific baselines in distributional alignment and structural metrics. This approach offers a practical, scalable path for rapid, multi-chain protein design in diverse biomedical applications.

Abstract

The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.

ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

TL;DR

ProtFlow tackles the high-cost, slow-generation problem in protein design by applying flow matching to a compressed, semantically meaningful latent space derived from protein language models. By redesigning the latent space and employing Rectified Flow with the Reflow technique, it achieves fast, near one-shot generation while maintaining high sequence quality and structural plausibility. The framework is demonstrated across general peptides, long-chain proteins, antimicrobial peptides, and antibodies, where it outperforms task-specific baselines in distributional alignment and structural metrics. This approach offers a practical, scalable path for rapid, multi-chain protein design in diverse biomedical applications.

Abstract

The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.

Paper Structure

This paper contains 53 sections, 20 equations, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: Overview of ProtFlow for protein sequence design. (A) Left: the visualization of the mathematical working flow of ProtFlow, including the training and inference phases. Right: relationships of different protein groups; (B) the architecture schematic of each components of ProtFlow. The FM holder is trained with other components frozen. In the training phase, the sampled sequence $x$ is mapped to the latent space as $h_c$ by the pLM encoder and compressor; the FM holder learns the FM vector field $v_t$. In the inference phase, the ODE solver starts from a sampled random noise $\epsilon$ and starting time point $t$, iteratively solves the $h'_c$ with $v_t$ represented by the FM holder, and maps back to $x'$ by the decompressor and decoder.
  • Figure 2: Joint Design of Antibodies. Heavy chains and light chains are utilized to finetune ESM-2 decoders respectively. The embeddings are concatenated before fed into ProtFlow. $D_{hidden}$ is the ESM-2 hidden dimension; $L_{heavy}$ and $L_{light}$ are the maximum lengths of the heavy and light chains.
  • Figure 3: Visualized examples of general peptide design.
  • Figure 4: Visualized examples of general long-chain protein design.
  • Figure 5: Visualized examples of AMP design.
  • ...and 1 more figures