Table of Contents
Fetching ...

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

Yijia Xiao, Edward Sun, Yiqiao Jin, Qifan Wang, Wei Wang

TL;DR

ProteinGPT addresses the challenge of holistic protein understanding by fusing sequence and structure information into a multimodal language system. It adopts a two-stage training pipeline—modality alignment with frozen encoders and a projection layer, followed by instruction tuning on a ProteinQA-derived QA corpus built from RCSB-PDB. Across multiple backbones, ProteinGPT, especially the Mistral variant, achieves superior semantic and lexical alignment with protein-focused questions, outperforming vanilla LLMs and general-purpose models. The work provides open-source code and the ProteinQA dataset, enabling researchers to extend modality fusion for protein design and discovery, with future directions including retrieval grounding and lab-work integration.

Abstract

Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequences and/or structures for comprehensive analysis and responsive inquiries. ProteinGPT integrates protein sequence and structure encoders with linear projection layers to ensure precise representation adaptation and leverages a large language model (LLM) to generate accurate, contextually relevant responses. To train ProteinGPT, we constructed a large-scale dataset of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs per protein, and optimized the instruction-tuning process using GPT-4o. Experiments demonstrate that ProteinGPT effectively generates informative responses to protein-related questions, achieving high performance on both semantic and lexical metrics and significantly outperforming baseline models and general-purpose LLMs in understanding and responding to protein-related queries. Our code and data are available at https://github.com/ProteinGPT/ProteinGPT.

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

TL;DR

ProteinGPT addresses the challenge of holistic protein understanding by fusing sequence and structure information into a multimodal language system. It adopts a two-stage training pipeline—modality alignment with frozen encoders and a projection layer, followed by instruction tuning on a ProteinQA-derived QA corpus built from RCSB-PDB. Across multiple backbones, ProteinGPT, especially the Mistral variant, achieves superior semantic and lexical alignment with protein-focused questions, outperforming vanilla LLMs and general-purpose models. The work provides open-source code and the ProteinQA dataset, enabling researchers to extend modality fusion for protein design and discovery, with future directions including retrieval grounding and lab-work integration.

Abstract

Understanding biological processes, drug development, and biotechnological advancements requires a detailed analysis of protein structures and functions, a task that is inherently complex and time-consuming in traditional protein research. To streamline this process, we introduce ProteinGPT, a state-of-the-art multimodal large language model for proteins that enables users to upload protein sequences and/or structures for comprehensive analysis and responsive inquiries. ProteinGPT integrates protein sequence and structure encoders with linear projection layers to ensure precise representation adaptation and leverages a large language model (LLM) to generate accurate, contextually relevant responses. To train ProteinGPT, we constructed a large-scale dataset of 132,092 proteins, each annotated with 20-30 property tags and 5-10 QA pairs per protein, and optimized the instruction-tuning process using GPT-4o. Experiments demonstrate that ProteinGPT effectively generates informative responses to protein-related questions, achieving high performance on both semantic and lexical metrics and significantly outperforming baseline models and general-purpose LLMs in understanding and responding to protein-related queries. Our code and data are available at https://github.com/ProteinGPT/ProteinGPT.
Paper Structure (29 sections, 1 equation, 10 figures, 5 tables)

This paper contains 29 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: ProteinGPT Modality Fusion & Alignment Stage: we freeze the encoder blocks and train the linear project layer to learn how to align protein structure and protein sequence representations with text. In the alignment stage, the input to the training is only the projected protein representation. No text prompts are incorporated in this stage.
  • Figure 2: ProteinGPT Instruction Tuning Stage: we utilize the QA pairs and property tags in ProteinQA to tune the LLM to follow instructions and give concise responses. For instruction alignment, explicit prompts (Questions on the protein) are included at the beginning of the prompt.
  • Figure 3: Protein Text LLM takes protein primary sequence as part of the prompt to the model. GPT models are more powerful than open-source LLMs like LLaMA and Mistral. Given the same protein sequence as input, ProteinGPT utilizes the information from sequence and structure encoders and yields more accurate responses.
  • Figure 4: Performance improves progressively from the vanilla LLM model with protein as text to the modality-aligned version, and finally to the instruction-tuned variants of ProteinGPT. Each stage of ProteinGPT's training results in substantial enhancements in both lexical and semantic performance, showcasing the efficiency of our framework.
  • Figure 5: Conversations between humans and ProteinGPT on Protein 6O7Q, where ProteinGPT provides detailed insights into both sequence (e.g., 60-subunit MoFe proteins) and structural information (e.g., substrate azide and product ammonia).
  • ...and 5 more figures