Table of Contents
Fetching ...

EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations

Nuowei Liu, Changzhi Sun, Tao Ji, Junfeng Tian, Jianxin Tang, Yuanbin Wu, Man Lan

TL;DR

EvoLlama addresses the gap where LLMs treat protein sequences as plain text by integrating structure- and sequence-based encoders with a large language model. The approach fuses modality-specific representations through a lightweight projection layer and trains in two stages to enable zero-shot instruction following and improved protein property prediction, using data from Swiss-Prot, PEER, and Mol-Instructions. Empirical results show zero-shot improvements of 1–8% over baselines and about 6% gains with supervised fine-tuning, with competitive performance on PEER protein property tasks. This work demonstrates that bridging multimodal protein representations with an LLM can enhance protein understanding while maintaining efficiency and scalability for downstream biological tasks.

Abstract

Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. Meanwhile, Protein Language Models (PLMs), such as ESM-2, have learned massive sequential evolutionary knowledge from the universe of natural protein sequences. Furthermore, structure-based encoders like ProteinMPNN learn the structural information of proteins through Graph Neural Networks. However, whether the incorporation of protein encoders can enhance the protein understanding of LLMs has not been explored. To bridge this gap, we propose EvoLlama, a multimodal framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding. EvoLlama consists of a ProteinMPNN structure encoder, an ESM-2 protein sequence encoder, a multimodal projector to align protein and text representations and a Llama-3 text decoder. To train EvoLlama, we fine-tune it on protein-oriented instructions and protein property prediction datasets verbalized via natural language instruction templates. Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced, outperforming other fine-tuned protein-oriented LLMs in zero-shot settings by an average of 1%-8% and surpassing the state-of-the-art baseline with supervised fine-tuning by an average of 6%. On protein property prediction datasets, our approach achieves promising results that are competitive with state-of-the-art task-specific baselines. We will release our code in a future version.

EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations

TL;DR

EvoLlama addresses the gap where LLMs treat protein sequences as plain text by integrating structure- and sequence-based encoders with a large language model. The approach fuses modality-specific representations through a lightweight projection layer and trains in two stages to enable zero-shot instruction following and improved protein property prediction, using data from Swiss-Prot, PEER, and Mol-Instructions. Empirical results show zero-shot improvements of 1–8% over baselines and about 6% gains with supervised fine-tuning, with competitive performance on PEER protein property tasks. This work demonstrates that bridging multimodal protein representations with an LLM can enhance protein understanding while maintaining efficiency and scalability for downstream biological tasks.

Abstract

Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. Meanwhile, Protein Language Models (PLMs), such as ESM-2, have learned massive sequential evolutionary knowledge from the universe of natural protein sequences. Furthermore, structure-based encoders like ProteinMPNN learn the structural information of proteins through Graph Neural Networks. However, whether the incorporation of protein encoders can enhance the protein understanding of LLMs has not been explored. To bridge this gap, we propose EvoLlama, a multimodal framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding. EvoLlama consists of a ProteinMPNN structure encoder, an ESM-2 protein sequence encoder, a multimodal projector to align protein and text representations and a Llama-3 text decoder. To train EvoLlama, we fine-tune it on protein-oriented instructions and protein property prediction datasets verbalized via natural language instruction templates. Our experiments show that EvoLlama's protein understanding capabilities have been significantly enhanced, outperforming other fine-tuned protein-oriented LLMs in zero-shot settings by an average of 1%-8% and surpassing the state-of-the-art baseline with supervised fine-tuning by an average of 6%. On protein property prediction datasets, our approach achieves promising results that are competitive with state-of-the-art task-specific baselines. We will release our code in a future version.

Paper Structure

This paper contains 45 sections, 1 equation, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overall architecture and the training pipeline of the EvoLlama.
  • Figure 2: An example of the projection tuning data and supervised fine-tuning data. Note that the special token <protein> denotes the fused protein representations of structural and sequential features.
  • Figure 3: Overview of the projection tuning data construction.
  • Figure 4: The prompt and response template of the projection tuning data. In the response template, similarity refers to the families of the protein in Swiss-Prot.
  • Figure 5: Overview of the supervised fine-tuning data construction.
  • ...and 6 more figures