ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang
TL;DR
ProtChatGPT addresses the challenge of making protein knowledge accessible via natural language by bridging multi-level protein representations with large language models. The authors propose a progressive Protein-Language Pretraining (PLP) framework consisting of multi-level encoding, PLP-former alignment, and an instruction-tuning stage with a projection adapter, enabling protein-to-text generation using frozen LLMs. They demonstrate that ProtChatGPT improves protein understanding and design tasks through qualitative conversations, Protein Q&A on PDB-QA, and cross-modal protein-text retrieval on ProteinKG25, with ablation studies confirming the importance of each component. The work promises to democratize protein analysis by enabling interactive, text-based exploration and could seed future protein research tools.
Abstract
Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have made significant strides in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. In this work, we introduce ProtChatGPT, which aims at learning and understanding protein structures via natural languages. ProtChatGPT enables users to upload proteins, ask questions, and engage in interactive conversations to produce comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoders and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM finally combines user questions with projected embeddings to generate informative answers. Experiments show that ProtChatGPT can produce promising responses to proteins and their corresponding questions. We hope that ProtChatGPT could form the basis for further exploration and application in protein research. Code and our pre-trained model will be publicly available.
