Table of Contents
Fetching ...

CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao

TL;DR

CFP-Gen presents a diffusion-based, multimodal protein design framework that jointly enforces function, sequence motifs, and structure through Annotation-Guided Feature Modulation (AGFM), Residue-Controlled Functional Encoding (RCFE), and an off-the-shelf structure encoder. By training on GO/IPR/EC annotations and backbone coordinates, CFP-Gen achieves superior functional fidelity, inverse folding performance, and multi-objective design efficiency compared with prior controllable PLMs. The approach alleviates mode collapse, preserves structural coherence, and demonstrates strong novelty and diversity in generated sequences, enabling practical design of multifunctional enzymes and functional proteins. The results suggest a scalable path toward more comprehensive condition sets and end-to-end co-design with structural constraints for real-world biotechnological applications.

Abstract

Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

TL;DR

CFP-Gen presents a diffusion-based, multimodal protein design framework that jointly enforces function, sequence motifs, and structure through Annotation-Guided Feature Modulation (AGFM), Residue-Controlled Functional Encoding (RCFE), and an off-the-shelf structure encoder. By training on GO/IPR/EC annotations and backbone coordinates, CFP-Gen achieves superior functional fidelity, inverse folding performance, and multi-objective design efficiency compared with prior controllable PLMs. The approach alleviates mode collapse, preserves structural coherence, and demonstrates strong novelty and diversity in generated sequences, enabling practical design of multifunctional enzymes and functional proteins. The results suggest a scalable path toward more comprehensive condition sets and end-to-end co-design with structural constraints for real-world biotechnological applications.

Abstract

Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.

Paper Structure

This paper contains 31 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Motivation of CFP-Gen. (a) Previous PLMs typically generate proteins based on single-modality conditioning, considering only individual functional constraints. (b) In contrast, CFP-Gen incorporates multiple conditions from diverse modalities—function, sequence and structure—to impose comprehensive functional constraints, thereby leading to optimized proteins.
  • Figure 2: Pipeline of CFP-Gen model. Functional conditions from diverse modalities, combined with the noised sequence, are iteratively processed by the model to generate desired proteins. Within each modified ESM block, AGFM adaptively adjusts the noised sequence embedding based on combinations of various functional annotations. Furthermore, sequence motifs and backbone atomic coordinates are embedded by RCFE and a structure encoder, respectively, providing precise and flexible guidance for the generation process.
  • Figure 3: Examples of multi-catalytic enzymes.CFP-Gen generates high-quality proteins (i.e., TM-score above 90) with multimodal conditions. The ground-truth structures from the AFDB database are in green, while the generated structures are in red.
  • Figure 4: Evaluation of multi-catalytic enzyme design. Our generated proteins exhibit high designability, meanwhile achieving high success rate and functionality as validated by CLEAN.
  • Figure 5: Comparison of sequence novelty and diversity between real and our designed proteins across 7 typical EC numbers from different enzyme families.
  • ...and 4 more figures