Table of Contents
Fetching ...

A Systematic Evaluation of Knowledge Graph Embeddings for Gene-Disease Association Prediction

Catarina Canastra, Cátia Pesquita

TL;DR

This paper systematically evaluates knowledge graph embeddings for gene–disease association prediction, contrasting link prediction with node-pair classification and assessing how semantic richness from disease ontologies influences performance. Using KG configurations built from GO, HP, DO, and cross-ontology LDs/MAPs, it applies both shallow KG embeddings and walk-based RDF2Vec, paired with end-to-end scoring or supervised classifiers. The study finds that link prediction generally outperforms node-pair classification, especially when ontologies and inter-ontology links are included, though node-pair classification reliably ranks true positives in test sets. These insights support a framework for selecting task types and KG configurations to optimize gene–disease discovery and have practical implications for disease mechanism understanding and drug repurposing.

Abstract

Discovery gene-disease links is important in biology and medicine areas, enabling disease identification and drug repurposing. Machine learning approaches accelerate this process by leveraging biological knowledge represented in ontologies and the structure of knowledge graphs. Still, many existing works overlook ontologies explicitly representing diseases, missing causal and semantic relationships between them. The gene-disease association problem naturally frames itself as a link prediction task, where embedding algorithms directly predict associations by exploring the structure and properties of the knowledge graph. Some works frame it as a node-pair classification task, combining embedding algorithms with traditional machine learning algorithms. This strategy aligns with the logic of a machine learning pipeline. However, the use of negative examples and the lack of validated gene-disease associations to train embedding models may constrain its effectiveness. This work introduces a novel framework for comparing the performance of link prediction versus node-pair classification tasks, analyses the performance of state of the art gene-disease association approaches, and compares the different order-based formalizations of gene-disease association prediction. It also evaluates the impact of the semantic richness through a disease-specific ontology and additional links between ontologies. The framework involves five steps: data splitting, knowledge graph integration, embedding, modeling and prediction, and method evaluation. Results show that enriching the semantic representation of diseases slightly improves performance, while additional links generate a greater impact. Link prediction methods better explore the semantic richness encoded in knowledge graphs. Although node-pair classification methods identify all true positives, link prediction methods outperform overall.

A Systematic Evaluation of Knowledge Graph Embeddings for Gene-Disease Association Prediction

TL;DR

This paper systematically evaluates knowledge graph embeddings for gene–disease association prediction, contrasting link prediction with node-pair classification and assessing how semantic richness from disease ontologies influences performance. Using KG configurations built from GO, HP, DO, and cross-ontology LDs/MAPs, it applies both shallow KG embeddings and walk-based RDF2Vec, paired with end-to-end scoring or supervised classifiers. The study finds that link prediction generally outperforms node-pair classification, especially when ontologies and inter-ontology links are included, though node-pair classification reliably ranks true positives in test sets. These insights support a framework for selecting task types and KG configurations to optimize gene–disease discovery and have practical implications for disease mechanism understanding and drug repurposing.

Abstract

Discovery gene-disease links is important in biology and medicine areas, enabling disease identification and drug repurposing. Machine learning approaches accelerate this process by leveraging biological knowledge represented in ontologies and the structure of knowledge graphs. Still, many existing works overlook ontologies explicitly representing diseases, missing causal and semantic relationships between them. The gene-disease association problem naturally frames itself as a link prediction task, where embedding algorithms directly predict associations by exploring the structure and properties of the knowledge graph. Some works frame it as a node-pair classification task, combining embedding algorithms with traditional machine learning algorithms. This strategy aligns with the logic of a machine learning pipeline. However, the use of negative examples and the lack of validated gene-disease associations to train embedding models may constrain its effectiveness. This work introduces a novel framework for comparing the performance of link prediction versus node-pair classification tasks, analyses the performance of state of the art gene-disease association approaches, and compares the different order-based formalizations of gene-disease association prediction. It also evaluates the impact of the semantic richness through a disease-specific ontology and additional links between ontologies. The framework involves five steps: data splitting, knowledge graph integration, embedding, modeling and prediction, and method evaluation. Results show that enriching the semantic representation of diseases slightly improves performance, while additional links generate a greater impact. Link prediction methods better explore the semantic richness encoded in knowledge graphs. Although node-pair classification methods identify all true positives, link prediction methods outperform overall.

Paper Structure

This paper contains 27 sections, 1 equation, 2 figures, 18 tables.

Figures (2)

  • Figure 1: Diagram of the comparison framework for link prediction and node-pair classification tasks. The framework consists of shared steps (data splitting, knowledge graph construction and evaluation) and task-specific steps: link prediction integrates positive training pairs into the knowledge graphs and applies the scoring function of knowledge graph embedding methods, whereas node-pair classification combines gene and disease embeddings, train supervised learning algorithms and test classifiers. Rectangles are color-coded: gray for link prediction-specific steps and yellow for node-pair classification-specific steps.
  • Figure 2: Visual depiction illustrating interconnected relationships among genes, diseases, and ontologies, providing a deeper understanding of complex biological associations within a structured knowledge graph.