Table of Contents
Fetching ...

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

Alberto Cattaneo, Stephen Bonner, Thomas Martynec, Edward Morrissey, Carlo Luschi, Ian P Barrett, Daniel Justus

TL;DR

This paper investigates how graph topology affects biomedical knowledge graph completion by performing a triple-level analysis across six public biomedical KGs with five KGE models. It introduces a topology-focused framework and a toolkit to describe and analyze per-edge properties, revealing that tail in-degree positively and head out-degree negatively correlate with predictive accuracy, while composition patterns help mainly for low-degree cases. The study also shows that model performance on specific relation types and in cross-dataset scenarios can vary substantially, and that adding large amounts of training data can harm shallow models, highlighting the need for principled KG construction and validation. Overall, the work provides practical guidance for biomedical KG construction and evaluation and offers tools and data to enable further topology-driven analyses in the community.

Abstract

Knowledge Graph Completion has been increasingly adopted as a useful method for helping address several tasks in biomedical research, such as drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models have been proposed over the years. However, little is known about the properties that render a dataset, and associated modelling choices, useful for a given task. Moreover, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. In this work, we conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world tasks. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.

The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models

TL;DR

This paper investigates how graph topology affects biomedical knowledge graph completion by performing a triple-level analysis across six public biomedical KGs with five KGE models. It introduces a topology-focused framework and a toolkit to describe and analyze per-edge properties, revealing that tail in-degree positively and head out-degree negatively correlate with predictive accuracy, while composition patterns help mainly for low-degree cases. The study also shows that model performance on specific relation types and in cross-dataset scenarios can vary substantially, and that adding large amounts of training data can harm shallow models, highlighting the need for principled KG construction and validation. Overall, the work provides practical guidance for biomedical KG construction and evaluation and offers tools and data to enable further topology-driven analyses in the community.

Abstract

Knowledge Graph Completion has been increasingly adopted as a useful method for helping address several tasks in biomedical research, such as drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models have been proposed over the years. However, little is known about the properties that render a dataset, and associated modelling choices, useful for a given task. Moreover, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. In this work, we conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world tasks. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.
Paper Structure (15 sections, 29 figures, 4 tables)

This paper contains 15 sections, 29 figures, 4 tables.

Figures (29)

  • Figure 1: The four primary edge topological patterns we consider. We show $(h, r, t)$ as the base triple common to all patterns, while the dashed lines with relations $r^\prime$, $r_1$ or $r_2$, are edges that realize the defining feature of the pattern. We use $r^\prime$, $r_1$ or $r_2$ to denote relations distinct from $r$.
  • Figure 2: Example of edge cardinalities in a KG with two relation types (blue, red). We use the notation $\textrm{deg}_r(t)$:$\textrm{deg}_r(h)$. For example, for the red edge with cardinality 1:M, the head has more than one outgoing edges of the same relation type (red) and the tail has a single red incoming edge.
  • Figure 3: Counts of composition patterns for key relation types within Hetionet and PrimeKG. Values above the bars indicate the number of distinct pairs of relation types in the composition.
  • Figure 4: Occurrence of edge cardinalities in the datasets.
  • Figure 5: Relative frequency of triples when grouped by head out-degree and tail in-degree of the same relation type.
  • ...and 24 more figures