Table of Contents
Fetching ...

GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration

Ziwen Li, Xiang 'Anthony' Chen, Youngseung Jeon

TL;DR

GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework is proposed to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

Abstract

Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration

TL;DR

GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework is proposed to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

Abstract

Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

Paper Structure

This paper contains 28 sections, 1 equation, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: GraPPI for target identification (Target ID). Based on the two inputs- an initial protein and therapeutic impact on the initial protein- GraPPI recommends PPI pathways with explanations and retrieves text based on previous work.
  • Figure 2: Overview of GraPPI framework. The input of the users to GraPPI is the name of the initial protein and the therapeutic impact query. The outputs are recommended PPIs with AI-generated explanations and retrieved information from the database.
  • Figure 3: Box plots of results under different graph sizes. The blue color in the background indicates the level of difference between the two groups. Darker blue represents a more significant difference. For plots (a), (b), and (c), the red and orange boxes represent different accuracies of GraPPI and the system directly using protein annotations, respectively. For plot (d), the green and purple boxes indicate the number of input tokens they have. Raw refers to the methods using raw annotations texts as contexts while Ours utilizes the edge explanations with more concise representation of biomedical context.
  • Figure 4: Part of the results of the case study showing the input and output content regarding certain recommended PPI signaling pathways.
  • Figure 5: Prompt Templates of Edge Explanation and Path Explanations