Table of Contents
Fetching ...

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian

TL;DR

ProLLaMA introduces a multitask protein language model enhanced by the Evolutionary Protein Generation Framework to unify Protein Language Understanding and Generation. The approach uses a two-stage, LoRA-based training regime on a huge protein/instruction corpus and applies EPGF at inference with a biophysical scorer, segment-aware decoding, and adaptive diversification to ensure biological viability. Empirical results show superior unconditional and controllable protein generation quality, robust superfamily prediction (67.1% exact match), and improved biophysical/structural metrics when using EPGF. The work offers a scalable path to biologically grounded protein design with broad applicability in protein engineering and discovery.

Abstract

Recent advances in Protein Language Models (PLMs) have transformed protein engineering, yet unlike their counterparts in Natural Language Processing (NLP), current PLMs exhibit a fundamental limitation: they excel in either Protein Language Understanding (PLU) or Protein Language Generation (PLG), but rarely both. This fragmentation hinders progress in protein engineering. To bridge this gap, we introduce ProLLaMA, a multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF). We construct a comprehensive instruction dataset containing approximately 13 million samples with over 11,000 superfamily annotations to facilitate better modeling of sequence-function landscapes. We leverage a two-stage training approach to develop ProLLaMA, a multitask LLM with protein domain expertise. Our EPGF addresses the mismatch between statistic language modeling and biological constraints through three innovations: a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism. Extensive experiments demonstrate that ProLLaMA excels in both unconditional and controllable protein generation tasks, achieving superior structural quality metrics compared to existing PLMs. Additionally, ProLLaMA demonstrates strong understanding capabilities with a 67.1% exact match rate in superfamily prediction. EPGF significantly enhances the biological viability of generated sequences, as evidenced by improved biophysical scores (+4.3%) and structural metrics (+14.5%). The project is available at https://github.com/PKU-YuanGroup/ProLLaMA.

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

TL;DR

ProLLaMA introduces a multitask protein language model enhanced by the Evolutionary Protein Generation Framework to unify Protein Language Understanding and Generation. The approach uses a two-stage, LoRA-based training regime on a huge protein/instruction corpus and applies EPGF at inference with a biophysical scorer, segment-aware decoding, and adaptive diversification to ensure biological viability. Empirical results show superior unconditional and controllable protein generation quality, robust superfamily prediction (67.1% exact match), and improved biophysical/structural metrics when using EPGF. The work offers a scalable path to biologically grounded protein design with broad applicability in protein engineering and discovery.

Abstract

Recent advances in Protein Language Models (PLMs) have transformed protein engineering, yet unlike their counterparts in Natural Language Processing (NLP), current PLMs exhibit a fundamental limitation: they excel in either Protein Language Understanding (PLU) or Protein Language Generation (PLG), but rarely both. This fragmentation hinders progress in protein engineering. To bridge this gap, we introduce ProLLaMA, a multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF). We construct a comprehensive instruction dataset containing approximately 13 million samples with over 11,000 superfamily annotations to facilitate better modeling of sequence-function landscapes. We leverage a two-stage training approach to develop ProLLaMA, a multitask LLM with protein domain expertise. Our EPGF addresses the mismatch between statistic language modeling and biological constraints through three innovations: a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism. Extensive experiments demonstrate that ProLLaMA excels in both unconditional and controllable protein generation tasks, achieving superior structural quality metrics compared to existing PLMs. Additionally, ProLLaMA demonstrates strong understanding capabilities with a 67.1% exact match rate in superfamily prediction. EPGF significantly enhances the biological viability of generated sequences, as evidenced by improved biophysical scores (+4.3%) and structural metrics (+14.5%). The project is available at https://github.com/PKU-YuanGroup/ProLLaMA.
Paper Structure (15 sections, 6 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: LLMs can handle both generation and understanding tasks, whereas PLMs cannot. This highlights the disparity in capabilities between the two. Right: Our ProLLaMA can handle generation tasks (unconditional protein generation, controllable protein generation) and understanding tasks (protein superfamily prediction), surpassing current PLMs.
  • Figure 2: (A) Overview of the dataset construction. The protein language dataset contains 53 million samples, which is used for training in Stage 1. The instruction dataset contains 13 million instances with 11,268 unique superfamily annotations, which is used for training in Stage 2. (B) Overview of the training framework. Stage 1: The pre-trained LLaMA-2 learns the protein language, resulting in ProLLaMA. Stage 2: ProLLaMA learns to perform multiple tasks by instruction tuning.
  • Figure 3: The overview of the ProLLaMA model. We add LoRA adapters to certain weights. We freeze original parameters, focusing solely on training LoRA adapters (Embed and Head are also involved in the first training stage).
  • Figure 4: The overview of EPGF. EPGF has three key components: (1) a multi-dimensional biophysical scorer; (2) a hierarchical efficient decoding strategy which generates protein candidates at segment-level; (3) probabilistic-biophysical joint selection with adaptive diversity control, which selects the superior candidate for the next round of generation.
  • Figure 5: ProLLaMA generates better protein sequences with EPGF. We visualize the (a) pLDDT (b) BioScore (c) TM-Score values of proteins belonging to four superfamiles (in order: SAM-MT, TPHD, Trx, CheY). $w/$: ProLLaMA with EPGF; $w/o$: ProLLaMA alone; $Natural$: Natural proteins as reference; $\mu$: the average value; $BioScore$: the biophysical score calculated by our scorer. EPGF improves the performance of ProLLaMA and even makes the generated proteins approach or even surpass the natural proteins on pLDDT.
  • ...and 2 more figures