Table of Contents
Fetching ...

InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

Zeyuan Wang, Qiang Zhang, Keyan Ding, Ming Qin, Xiang Zhuang, Xiaotong Li, Huajun Chen

TL;DR

InstructProtein presents a first-on-kind LLM capable of bidirectional generation between human and protein languages by jointly pretraining on protein sequences and natural language, and then aligning the two via KG-informed instruction tuning. A knowledge graph–based instruction generation framework, featuring knowledge causal modeling and debiased sampling, yields a high-quality instruction dataset that improves zero-shot protein understanding and de novo design tasks. Empirical results show InstructProtein outperforms state-of-the-art open and domain-specific LLMs across protein localization, function annotation, and metal binding tasks, as well as in instruction-following protein design, including structure-guided and ligand-binding design. This work bridges protein and human language understanding, enabling text-guided protein function prediction and sequence design with potential for scalable, instruction-driven biological discovery.

Abstract

Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.

InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

TL;DR

InstructProtein presents a first-on-kind LLM capable of bidirectional generation between human and protein languages by jointly pretraining on protein sequences and natural language, and then aligning the two via KG-informed instruction tuning. A knowledge graph–based instruction generation framework, featuring knowledge causal modeling and debiased sampling, yields a high-quality instruction dataset that improves zero-shot protein understanding and de novo design tasks. Empirical results show InstructProtein outperforms state-of-the-art open and domain-specific LLMs across protein localization, function annotation, and metal binding tasks, as well as in instruction-following protein design, including structure-guided and ligand-binding design. This work bridges protein and human language understanding, enabling text-guided protein function prediction and sequence design with potential for scalable, instruction-driven biological discovery.

Abstract

Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.
Paper Structure (24 sections, 5 equations, 12 figures, 6 tables)

This paper contains 24 sections, 5 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: An example of bidirectional generation by LLMs between human and protein languages. ChatGPT fails to provide an accurate response while the proposed InstructProtein offers a reasonable solution.
  • Figure 2: We visualized the top-5 subcellular location categories and their respective proportions, in comparison to the least frequently used annotations, which accounted for only 0.000224%.
  • Figure 3: Overview of Instruction generation methods. The red text represents the fields that rely on internal knowledge of LLMs. (a) Given a set of seed tasks, prompting an LLM to produce new instruction data.(b) Utilizing LLMs to generate the instruction data corresponding to the contents in raw documents. (c) The proposed knowledge graph (KG)-based instruction generation framework. We first construct a KG with knowledge causal modeling (KCM), and introduce a debiased sampler to pick the informative triples, which are then translated into instruction data through the use of LLMs in conjunction with KG completion tasks.
  • Figure 4: An example of converting a KG triple to instructions. Given a triple with KCM, we use an LLM cooperated with KG completion tasks to generate factual, logical, and diverse instructions.
  • Figure 5: Visualization of structure instruction-based protein sequence de novo design. We prompt our models with different scales (125m, 350m and 1.3b) to generate three kinds of proteins (all $\alpha$-helix, all $\beta$-sheet, and a combination of $\alpha$-helix and $\beta$-sheet), respectively. (a) We visualize the pLDDT of generated sequences predicted by AlphaFold2 to assess the protein foldability. (b) The embeddings of sequences prompted with all $\alpha$-helix and all $\beta$-sheet instructions, which are extracted from ESM2 and visualized by the MDS algorithm. (c) The structure of generated proteins with the highest confidence in each class.
  • ...and 7 more figures