Table of Contents
Fetching ...

Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure

Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao

TL;DR

Prot2Chat tackles the challenge of integrating multimodal protein data for Q&A by fusing sequence, structure, and text within a single large language model. It extends ProteinMPNN to create a unified protein encoder and introduces a text-aware adapter that compresses multimodal protein information into a soft prompt aligned with the input question, enabling early fusion with the LLM. The model is lightweight (approximately 109 million trainable parameters) due to freezing the encoder and using LoRA for the LLM, and it demonstrates superior performance on Mol-Instructions and UniProtQA with zero-shot generalization, outperforming several baselines and showing strong expert agreement. This work highlights the value of early multimodal fusion for accurate, context-aware protein reasoning and provides a practical framework for efficient, high-quality protein Q&A using LLMs.

Abstract

Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early-Fusion, LLM

Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure

TL;DR

Prot2Chat tackles the challenge of integrating multimodal protein data for Q&A by fusing sequence, structure, and text within a single large language model. It extends ProteinMPNN to create a unified protein encoder and introduces a text-aware adapter that compresses multimodal protein information into a soft prompt aligned with the input question, enabling early fusion with the LLM. The model is lightweight (approximately 109 million trainable parameters) due to freezing the encoder and using LoRA for the LLM, and it demonstrates superior performance on Mol-Instructions and UniProtQA with zero-shot generalization, outperforming several baselines and showing strong expert agreement. This work highlights the value of early multimodal fusion for accurate, context-aware protein reasoning and provides a practical framework for efficient, high-quality protein Q&A using LLMs.

Abstract

Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early-Fusion, LLM

Paper Structure

This paper contains 18 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Prot2Chat can assist human in understanding protein information and achieve cross-modal information communication. Among them, '$<$Soft Prompt$>$' is the prompt obtained by our model through fusing protein structure, sequence, and text information, which helps the LLM generate more valuable answers.
  • Figure 2: Model Structure of Prot2Chat. The red font represents the input, the snowflake represents freezing, and the flame represents the parameters to be trained. We obtain the embedding with multi-dimensional feature fusion of proteins from protein structure and sequence information through an encoder. Meanwhile, we get the question vector from the question text. Then, we conduct early-fusion and alignment of this vector with protein information to obtain the soft prompt. Finally, we input the soft prompt along with the question text into the LLM to get the answer.
  • Figure 3: Detail of the Text-Aware Protein-Text Adapter. The adapter takes protein embedding and question vector as inputs. It integrates text information at an early stage through a set of learnable queries. Subsequently, using these queries, the protein embedding is used as keys/values for cross-attention calculations. This cross-attention module compresses the protein information into a fixed length and captures the key protein information according to the queries infused with question text information, resulting in soft prompt to assist the LLM in generating more accurate responses.