Table of Contents
Fetching ...

A Text-guided Protein Design Framework

Shengchao Liu, Yanjing Li, Zhuoxinran Li, Anthony Gitter, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Arvind Ramanathan, Chaowei Xiao, Jian Tang, Hongyu Guo, Anima Anandkumar

TL;DR

ProteinDT introduces a unified multi-modal framework that combines textual protein descriptions with sequence data to enable text-guided protein generation and editing. It couples ProteinCLAP (text–protein contrastive pretraining), ProteinFacilitator (text-to-protein latent mapping via a Gaussian), and flexible decoders (autoregressive and diffusion) to produce protein sequences conditioned on text. The authors curate SwissProtCLAP (441k pairs) to train ProteinDT and demonstrate strong zero-shot performance: over 90% retrieval-like accuracy for text-to-protein generation, best-hit ratio across 12 editing tasks, and competitive or superior results on six protein-property benchmarks. The work highlights the potential of textual knowledge to guide protein design, enabling zero-shot editing and providing a scalable path toward functionally guided design without task-specific labeled data.

Abstract

Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.

A Text-guided Protein Design Framework

TL;DR

ProteinDT introduces a unified multi-modal framework that combines textual protein descriptions with sequence data to enable text-guided protein generation and editing. It couples ProteinCLAP (text–protein contrastive pretraining), ProteinFacilitator (text-to-protein latent mapping via a Gaussian), and flexible decoders (autoregressive and diffusion) to produce protein sequences conditioned on text. The authors curate SwissProtCLAP (441k pairs) to train ProteinDT and demonstrate strong zero-shot performance: over 90% retrieval-like accuracy for text-to-protein generation, best-hit ratio across 12 editing tasks, and competitive or superior results on six protein-property benchmarks. The work highlights the potential of textual knowledge to guide protein design, enabling zero-shot editing and providing a scalable path toward functionally guided design without task-specific labeled data.

Abstract

Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
Paper Structure (33 sections, 13 equations, 10 figures, 24 tables)

This paper contains 33 sections, 13 equations, 10 figures, 24 tables.

Figures (10)

  • Figure 1: Pipeline for ProteinDT pretraining framework (a-c) and downstream tasks (d-f). (a) ProteinCLAP, a contrastive learning paradigm, aligns the representation space of the text and protein sequence modalities. (b) ProteinFacilitator model augments the mapping from text sequence representation to protein sequence representation. (c) A protein sequence decoder, which generates protein sequences conditioned on the representations produced from previous steps. (d) Downstream text-to-protein generation task. (e) Downstream text-guided protein editing task. (f) Downstream protein property prediction task.
  • Figure 2: Visualization of text-to-protein generation and text-guided protein editing. (a) Visualization for evaluation of text-to-protein generation. The pretrained ProteinCLAP is used to calculate the similarity between the sampled text and generated protein sequence pairs. (b-c) Two methods for text-guided protein editing: latent interpolation and latent optimization. (d-e) Visualization for evaluation of text-guided protein editing. For four types of editing tasks, different evaluation metrics (marked in red) are applied accordingly.
  • Figure 3: Visual analysis on text-guided protein editing with latent optimization. (a-c) visualize structure editing with more/less $\alpha$-helices editing. (d-f) visualize structure editing with more/less $\beta$-sheets editing. (g-i) visualize peptide binding editing on PDB 3IQI.
  • Figure 4: The inference illustration for two decoder models (step 3 in ProteinDT). \ref{['fig:pipeline_autoregressive']} is an autoregressive (AR) model based on the Transformer. It generates the protein sequence token-by-token. \ref{['fig:pipeline_diffusion_model']} is ProteinDiff, and a Transformer-based transition network is displayed. It first randomly samples a noised sequence, then conducts a denoising process using the transition network.
  • Figure 5: Illustration of retrieval accuracy in text-to-protein generation.
  • ...and 5 more figures