Table of Contents
Fetching ...

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

Dan Kalifa, Uriel Singer, Kira Radinsky

TL;DR

GOProteinGNN tackles the limitation of sequence-only protein representations by integrating a comprehensive protein knowledge graph into protein language models. It introduces a Graph Neural Networks Knowledge Injection (GKI) layer that uses the [CLS] token to propagate graph-derived knowledge into amino acid sequence representations, and it learns the entire KG during pre-training. The approach achieves state-of-the-art performance across diverse bioinformatics tasks, including contact prediction, semantic similarity, PPI identification, and remote homology detection, demonstrating the practical value of holistic, graph-aware protein representations. This framework holds promise for enhanced drug discovery and virtual screening by more accurately modeling protein context and interactions.

Abstract

Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

TL;DR

GOProteinGNN tackles the limitation of sequence-only protein representations by integrating a comprehensive protein knowledge graph into protein language models. It introduces a Graph Neural Networks Knowledge Injection (GKI) layer that uses the [CLS] token to propagate graph-derived knowledge into amino acid sequence representations, and it learns the entire KG during pre-training. The approach achieves state-of-the-art performance across diverse bioinformatics tasks, including contact prediction, semantic similarity, PPI identification, and remote homology detection, demonstrating the practical value of holistic, graph-aware protein representations. This framework holds promise for enhanced drug discovery and virtual screening by more accurately modeling protein context and interactions.

Abstract

Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.
Paper Structure (41 sections, 3 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 3 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: An example of a Protein Knowledge Graph. The upper figure illustrates a protein with associated biological knowledge, while the lower figure depicts the corresponding knowledge graph. The central node, Q14028, represents the protein connected to multiple GO terms through matching relations (edges), providing insights into Molecular Function, Cellular Component, and Biological Process.
  • Figure 2: The GOProteinGNN pre-training architecture. The model incorporates a protein's knowledge graph with relations and GO terms. It uses encoder layers to process the amino acid sequence, producing amino acid representations and a protein representation (which is represented by the [CLS] representation). Then, the GKI refines the protein's representation with knowledge graph information using graph learning techniques, yielding a knowledge-enhanced protein representation that captures essential biological contexts and interactions. Subsequent encoder layers further enhance the amino acid representation, while the model performs masked language modeling (MLM) to restore masked amino acids.