Table of Contents
Fetching ...

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

TL;DR

ProtT3 tackles the gap in protein understanding by fusing a protein-language model with a language model through a cross-modal projector to enable protein-to-text generation. The method unfolds in two stages: retrieval-oriented pretraining to align protein and text representations, followed by generation-focused training that conditions text output on protein inputs. Key contributions include a formal benchmark suite for protein captioning, retrieval, and QA, state-of-the-art gains on Swiss-Prot, ProteinKG25, and PDB-QA, and an efficient, scalable training setup using LoRA adapters. This work paves the way for text-based protein understanding and downstream biomedical applications, with planned extensions to 3D structure integration and broader biological reasoning.

Abstract

Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

TL;DR

ProtT3 tackles the gap in protein understanding by fusing a protein-language model with a language model through a cross-modal projector to enable protein-to-text generation. The method unfolds in two stages: retrieval-oriented pretraining to align protein and text representations, followed by generation-focused training that conditions text output on protein inputs. Key contributions include a formal benchmark suite for protein captioning, retrieval, and QA, state-of-the-art gains on Swiss-Prot, ProteinKG25, and PDB-QA, and an efficient, scalable training setup using LoRA adapters. This work paves the way for text-based protein understanding and downstream biomedical applications, with planned extensions to 3D structure integration and broader biological reasoning.

Abstract

Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.
Paper Structure (23 sections, 4 equations, 5 figures, 14 tables)

This paper contains 23 sections, 4 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Examples of protein-to-text generation tasks. Proteins are represented by sequences of amino acids.
  • Figure 2: Overview of the ProtT3 framework.
  • Figure 3: The training stage 1 of ProtT3. (a): Cross-Model Projector: Q-Former's architecture and the three training tasks. (b): The self-attention module uses different masking strategies for different tasks.
  • Figure 4: Protein captioning examples from Swiss-Prot. We highlight sentences that exactly match the ground truth. Figures of protein structures are generated by AlphaFold2 AlphaFold2.
  • Figure 5: Examples of protein QA results in the PDB-QA dataset. We highlight the correct predictions.