Table of Contents
Fetching ...

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

Mingyu Jin, Haochen Xue, Zhenting Wang, Boming Kang, Ruosong Ye, Kaixiong Zhou, Mengnan Du, Yongfeng Zhang

TL;DR

ProLLM reframes protein-protein interaction prediction as a natural language reasoning task by encoding signaling pathways into Protein Chain of Thought (ProCoT) prompts for LLMs. It couples ProCoT with ProtTrans-based embedding replacement and instruction fine-tuning on a protein-knowledge corpus (Mol), enabling multi-step, pathway-aware inferences that capture non-physical, signaling-chain connections. Across four benchmark datasets (Human, SHS27K, SHS148K, STRING), ProLLM outperforms traditional ML, GNN-based, and prior LLM-based approaches in micro-F1 and demonstrates strong generalization. This work suggests a promising direction for integrating structured biological data with large language models to advance protein interaction discovery and functional network analyses, with potential implications for drug discovery and systems biology.

Abstract

The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions, ignoring the broader context of nonphysical connections through intermediate proteins, thus limiting their effectiveness. The emergence of Large Language Models (LLMs) provides a new opportunity for addressing this complex biological challenge. By transforming structured data into natural language prompts, we can map the relationships between proteins into texts. This approach allows LLMs to identify indirect connections between proteins, tracing the path from upstream to downstream. Therefore, we propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time. Specifically, we propose Protein Chain of Thought (ProCoT), which replicates the biological mechanism of signaling pathways as natural language prompts. ProCoT considers a signaling pathway as a protein reasoning process, which starts from upstream proteins and passes through several intermediate proteins to transmit biological signals to downstream proteins. Thus, we can use ProCoT to predict the interaction between upstream proteins and downstream proteins. The training of ProLLM employs the ProCoT format, which enhances the model's understanding of complex biological problems. In addition to ProCoT, this paper also contributes to the exploration of embedding replacement of protein sites in natural language prompts, and instruction fine-tuning in protein knowledge datasets. We demonstrate the efficacy of ProLLM through rigorous validation against benchmark datasets, showing significant improvement over existing methods in terms of prediction accuracy and generalizability. The code is available at: https://github.com/MingyuJ666/ProLLM.

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

TL;DR

ProLLM reframes protein-protein interaction prediction as a natural language reasoning task by encoding signaling pathways into Protein Chain of Thought (ProCoT) prompts for LLMs. It couples ProCoT with ProtTrans-based embedding replacement and instruction fine-tuning on a protein-knowledge corpus (Mol), enabling multi-step, pathway-aware inferences that capture non-physical, signaling-chain connections. Across four benchmark datasets (Human, SHS27K, SHS148K, STRING), ProLLM outperforms traditional ML, GNN-based, and prior LLM-based approaches in micro-F1 and demonstrates strong generalization. This work suggests a promising direction for integrating structured biological data with large language models to advance protein interaction discovery and functional network analyses, with potential implications for drug discovery and systems biology.

Abstract

The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases. Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions, ignoring the broader context of nonphysical connections through intermediate proteins, thus limiting their effectiveness. The emergence of Large Language Models (LLMs) provides a new opportunity for addressing this complex biological challenge. By transforming structured data into natural language prompts, we can map the relationships between proteins into texts. This approach allows LLMs to identify indirect connections between proteins, tracing the path from upstream to downstream. Therefore, we propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time. Specifically, we propose Protein Chain of Thought (ProCoT), which replicates the biological mechanism of signaling pathways as natural language prompts. ProCoT considers a signaling pathway as a protein reasoning process, which starts from upstream proteins and passes through several intermediate proteins to transmit biological signals to downstream proteins. Thus, we can use ProCoT to predict the interaction between upstream proteins and downstream proteins. The training of ProLLM employs the ProCoT format, which enhances the model's understanding of complex biological problems. In addition to ProCoT, this paper also contributes to the exploration of embedding replacement of protein sites in natural language prompts, and instruction fine-tuning in protein knowledge datasets. We demonstrate the efficacy of ProLLM through rigorous validation against benchmark datasets, showing significant improvement over existing methods in terms of prediction accuracy and generalizability. The code is available at: https://github.com/MingyuJ666/ProLLM.
Paper Structure (32 sections, 8 figures, 5 tables)

This paper contains 32 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of the ProLLM Framework. We fine-tuning ProLLM under Human, SHS27K, SHS148K, and STRING datasets, enabling it to solve various PPI related tasks with the structure information purely described by natural language.
  • Figure 2: The difference between the existing method and our method in PPI prediction. Existing method focus on the property of upstream protein and downstream protein, our method focus on signaling pathway-like connection.
  • Figure 3: The process of ProLLM. Sector 1: Transfer the original protein data into ProCoT format of natural language that indicates the signaling pathways between proteins; Sector 2: Replace protein information embeddings with natural language embeddings to enhance the model's understanding of proteins; Sector 3: Inject knowledge about protein function; Sector 4: Fine-tuning on the ProCoT format dataset in Sector 1.
  • Figure 4: The fine-tuning process of ProCoT. Within the first dashed box, solid lines between proteins represent the signaling pathway, and the dashed lines connecting the head and tail proteins indicate the masked interaction. Our model will predict the type of masked interaction.
  • Figure 5: Demo of BFS and DFS dataset partition method.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4