Table of Contents
Fetching ...

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang

TL;DR

ProtST addresses the gap where protein language models rely primarily on sequences and lack explicit functional Property information. It introduces ProtDescribe, a large paired dataset of protein sequences and biomedical text descriptions, and a multimodal pre-training framework with unimodal mask prediction, multimodal representation alignment, and multimodal mask prediction to inject protein properties while preserving sequence power. Across localization, fitness, and function annotation benchmarks, ProtST-induced PLMs outperform strong baselines and support zero-shot protein classification and text-to-protein retrieval, demonstrating data-efficient generalization to unseen properties. The work offers a path toward more informative protein representations and enables text-guided protein discovery and retrieval, with potential extensions to protein structure and design.

Abstract

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

TL;DR

ProtST addresses the gap where protein language models rely primarily on sequences and lack explicit functional Property information. It introduces ProtDescribe, a large paired dataset of protein sequences and biomedical text descriptions, and a multimodal pre-training framework with unimodal mask prediction, multimodal representation alignment, and multimodal mask prediction to inject protein properties while preserving sequence power. Across localization, fitness, and function annotation benchmarks, ProtST-induced PLMs outperform strong baselines and support zero-shot protein classification and text-to-protein retrieval, demonstrating data-efficient generalization to unseen properties. The work offers a path toward more informative protein representations and enables text-guided protein discovery and retrieval, with potential extensions to protein structure and design.

Abstract

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
Paper Structure (35 sections, 12 equations, 8 figures, 17 tables)

This paper contains 35 sections, 12 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Graphical illustration of ProtST framework. (a) A protein language model (PLM) is first pre-trained along with a biomedical language model (BLM) and a fusion module to jointly model protein sequences and biomedical texts. (b) After this multi-modal pre-training, the PLM can be used individually for supervised learning on downstream tasks. (c) The couple of pre-trained PLM and BLM can perform zero-shot protein classification using only label descriptions. (d) The paired PLM and BLM can also retrieve functional proteins from a large-scale database without any function annotation.
  • Figure 2: Zero-shot ProtST-ESM-1b outperforms few-shot classifiers. The horizontal line with a red star denotes the zero-shot performance of ProtST-ESM-1b. All few-shot results are averaged over seeds 0, 1, 2, 3 and 4, and gray intervals denote standard deviations.
  • Figure 3: Zero-shot ProtST-ESM-1b enhances few-shot classifiers' performance via ensemble. The horizontal line with a red star denotes the zero-shot performance of ProtST-ESM-1b. All few-shot results are averaged over seeds 0, 1, 2, 3 and 4, and gray intervals denote standard deviations.
  • Figure 4: Zero-shot text-to-protein retrieval of heme binders based on ProtST-ESM-1b.
  • Figure 5: Architecture of the fusion layer. This layer fuses the protein representation and the text representation by querying over them with self-attention and cross-attention.
  • ...and 3 more figures