Table of Contents
Fetching ...

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs

Wei Wu, Chao Wang, Liyi Chen, Mingze Yin, Yiheng Zhu, Kun Fu, Jieping Ye, Hui Xiong, Zheng Wang

TL;DR

SEPIT tackles the challenge of general-purpose protein understanding by marrying structure-aware protein language models with large language models through a sequence/structure fused encoder, a linear projector, and a mixture-of-experts LLM. The approach employs a three-stage training pipeline—structure-informed warm-up, protein-caption pre-training, and MoEs-based instruction tuning—trained on the largest protein instruction dataset to date. Empirical results show SEPIT consistently outperforms zero-shot and task-tuned baselines on open-ended generation and closed-set questions, with ablations confirming the necessity of structure, MoEs, and high-quality data. The work advances practical protein understanding with potential impact on biology and drug discovery by enabling robust, generalizable reasoning over protein properties and functions.

Abstract

Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction tuning pipeline. First, we warm up the enhanced pLMs using contrastive learning and structure denoising. Then, caption-based instructions are used to establish a basic understanding of proteins. Finally, we refine this understanding by employing a mixture of experts (MoEs) to capture more complex properties and functional information with the same number of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experiments on both open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs

TL;DR

SEPIT tackles the challenge of general-purpose protein understanding by marrying structure-aware protein language models with large language models through a sequence/structure fused encoder, a linear projector, and a mixture-of-experts LLM. The approach employs a three-stage training pipeline—structure-informed warm-up, protein-caption pre-training, and MoEs-based instruction tuning—trained on the largest protein instruction dataset to date. Empirical results show SEPIT consistently outperforms zero-shot and task-tuned baselines on open-ended generation and closed-set questions, with ablations confirming the necessity of structure, MoEs, and high-quality data. The work advances practical protein understanding with potential impact on biology and drug discovery by enabling robust, generalizable reasoning over protein properties and functions.

Abstract

Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction tuning pipeline. First, we warm up the enhanced pLMs using contrastive learning and structure denoising. Then, caption-based instructions are used to establish a basic understanding of proteins. Finally, we refine this understanding by employing a mixture of experts (MoEs) to capture more complex properties and functional information with the same number of activated parameters. Moreover, we construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model. Extensive experiments on both open-ended generation and closed-set answer tasks demonstrate the superior performance of SEPIT over both closed-source general LLMs and open-source LLMs trained with protein knowledge.
Paper Structure (46 sections, 27 equations, 7 figures, 15 tables)

This paper contains 46 sections, 27 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: (a) The model architecture of the SEPIT framework includes sequence/structure fused protein encoder, linear projector, and LLMs with MoEs modules, (b) example of instruction format.
  • Figure 2: The three-stage training pipeline of SEPIT with a warm-up stage (stage 0) for protein encoder and two instruction tuning stages (stage 1 & stage 2).
  • Figure 3: The workload of experts in SEPIT (left) and tokens' pathways among experts (right).
  • Figure 4: Distribution of protein sequence length and conversation length in the protein instruction dataset (training set).
  • Figure 5: Workload of experts in SEPIT for protein tokens and text tokens.
  • ...and 2 more figures