Unlocking Advanced Graph Machine Learning Insights through Knowledge Completion on Neo4j Graph Database
Rosario Napoli, Antonio Celesti, Massimo Villari, Maria Fazio
TL;DR
This work identifies Knowledge Completion (KC) as a critical missing phase in Graph Database–Graph Machine Learning (GDB-GML) pipelines, which leads to incomplete graphs and biased downstream models. It proposes an architecture that inserts KC between Knowledge Fusion (KF) and Knowledge Reasoning (KR), introducing scalable transitive relationships with a decay-based propagation to deterministically reveal hidden knowledge and reshape topology prior to ML. The authors formalize transitive propagation with a transitivity indicator $T(e_{ij}, r) \in {0,1}$, path strengths $S_p(x,z,r) = f(h(p))$, and aggregated strength $S(x,z,r) = A(\{S_p(x,z,r)\mid p \in P(x,z,r)\})$, propagating when $S(x,z,r) > \tau$. Experimental results on Roman Empire and Royal Family genealogies show substantial changes in centrality measures and node influence after KC, demonstrating improved data representations and potential for enhanced GML performance; they also discuss practical scalability considerations and future integration with GNNs for robust downstream tasks.
Abstract
Graph Machine Learning (GML) with Graph Databases (GDBs) has gained significant relevance in recent years, due to its ability to handle complex interconnected data and apply ML techniques using Graph Data Science (GDS). However, a critical gap exists in the current way GDB-GML applications analyze data, especially in terms of Knowledge Completion (KC) in Knowledge Graphs (KGs). In particular, current architectures ignore KC, working on datasets that appear incomplete or fragmented, despite they actually contain valuable hidden knowledge. This limitation may cause wrong interpretations when these data are used as input for GML models. This paper proposes an innovative architecture that integrates a KC phase into GDB-GML applications, demonstrating how revealing hidden knowledge can heavily impact datasets' behavior and metrics. For this purpose, we introduce scalable transitive relationships, which are links that propagate information over the network and modelled by a decay function, allowing a deterministic knowledge flows across multiple nodes. Experimental results demonstrate that our intuition radically reshapes both topology and overall dataset dynamics, underscoring the need for this new GDB-GML architecture to produce better models and unlock the full potential of graph-based data analysis.
