ProtChatGPT: Towards Understanding Proteins with Large Language Models

Chao Wang; Hehe Fan; Ruijie Quan; Yi Yang

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang

TL;DR

ProtChatGPT addresses the challenge of making protein knowledge accessible via natural language by bridging multi-level protein representations with large language models. The authors propose a progressive Protein-Language Pretraining (PLP) framework consisting of multi-level encoding, PLP-former alignment, and an instruction-tuning stage with a projection adapter, enabling protein-to-text generation using frozen LLMs. They demonstrate that ProtChatGPT improves protein understanding and design tasks through qualitative conversations, Protein Q&A on PDB-QA, and cross-modal protein-text retrieval on ProteinKG25, with ablation studies confirming the importance of each component. The work promises to democratize protein analysis by enabling interactive, text-based exploration and could seed future protein research tools.

Abstract

Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have made significant strides in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. In this work, we introduce ProtChatGPT, which aims at learning and understanding protein structures via natural languages. ProtChatGPT enables users to upload proteins, ask questions, and engage in interactive conversations to produce comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoders and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM finally combines user questions with projected embeddings to generate informative answers. Experiments show that ProtChatGPT can produce promising responses to proteins and their corresponding questions. We hope that ProtChatGPT could form the basis for further exploration and application in protein research. Code and our pre-trained model will be publicly available.

ProtChatGPT: Towards Understanding Proteins with Large Language Models

TL;DR

Abstract

Paper Structure (44 sections, 17 equations, 6 figures, 8 tables)

This paper contains 44 sections, 17 equations, 6 figures, 8 tables.

Introduction
Related Work
Proposed Method
Multi-Level Protein Encoding
Multi-Level Protein-Language Alignment
Protein-Language Pretraining (PLP)
Multi-Level Protein Alignment
Protein Context Gating (PCG)
Contrastive Learning
Instruction Tuning with Protein Features
Projection Adapter
Protein-Text Generation
Experiments
Dataset
Protein-Language Pretraining Dataset
...and 29 more sections

Figures (6)

Figure 1: Overview of the ProtChatGPT framework. Our pipeline consists of three stages: (1) multi-level protein encoding, (2) multi-level protein-language alignment, and (3) instruction tuning with an external web corpus and protein features. First, we utilize three pre-trained frozen large protein encoders to acquire high-quality multi-level embeddings. In the second stage, we first enforce the PLP Transformer, a lightweight transformer with learnable query tokens, to learn the protein representation most relevant to the text description. The PLP-former takes the sequence embedding ${\bm{E}}_{seq}$, tokens, and protein descriptions as inputs and outputs the learned tokens as the selected embedding. The selected embedding is first aligned with the secondary structure embedding ${\bm{E}}_{sec}$ through protein context gating for aligned embedding ${\bm{E}}_{align}$, and then we adopt contrastive learning to enforce the tertiary structure embedding ${\bm{E}}_{ter}$ to further align with the joint representation ${\bm{E}}_{align}$. In the third stage, we perform protein-to-text generative learning by connecting the aligned ${\bm{E}}_{align}$ and ${\bm{E}}_{str}$ to an LLM decoder. Combined with the instruction pairs extracted from open protein-related literature, an adapter is further trained as the information bottleneck between protein embeddings and the LLM, such that its output can be interpreted by the language model. Finally, the LLM can produce descriptive answers given the question prompt and the multi-level protein prompt from the adapter.
Figure 2: Illustration of the PLP-former and protein-language representation learning. PLP-former consists of two transformer submodules with shared self-attention: (1) a text transformer that performs encoding and decoding of protein descriptions, and (2) a protein transformer that interacts with the frozen ESM-1b for sequence feature extraction. PLP-former is trained by jointly optimizing three pretraining objectives (dashed boxes) on sequence-description pairs.
Figure 3: Several conversation examples. (a, b): ProtChatGPT can fully explore the intrinsic properties of proteins and accurately understand user queries, enabling protein understanding and analysis. (c): ProtChatGPT also has the potential to assist in drug development through pathogenicity analysis, diagnostic simulation, and protein design.
Figure 4: Challenging case studies. Benefitting from contrastive-based feature alignment and domain-specific instruction tuning, our proposed ProtChatGPT is capable of handling complex scenarios such as (a, b) homologous proteins and (c) mutually exclusive functions.
Figure 5: Comparison of fine-tuning of PLP-former and LLM decoders during the instruction tuning stage. We compute the SPICE and PubMed BERTScore for semantic evaluation.
...and 1 more figures

ProtChatGPT: Towards Understanding Proteins with Large Language Models

TL;DR

Abstract

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)