Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure
Zhicong Wang, Zicheng Ma, Ziqiang Cao, Changlong Zhou, Jun Zhang, Yiqin Gao
TL;DR
Prot2Chat tackles the challenge of integrating multimodal protein data for Q&A by fusing sequence, structure, and text within a single large language model. It extends ProteinMPNN to create a unified protein encoder and introduces a text-aware adapter that compresses multimodal protein information into a soft prompt aligned with the input question, enabling early fusion with the LLM. The model is lightweight (approximately 109 million trainable parameters) due to freezing the encoder and using LoRA for the LLM, and it demonstrates superior performance on Mol-Instructions and UniProtQA with zero-shot generalization, outperforming several baselines and showing strong expert agreement. This work highlights the value of early multimodal fusion for accurate, context-aware protein reasoning and provides a practical framework for efficient, high-quality protein Q&A using LLMs.
Abstract
Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early-Fusion, LLM
