Improving Graph Embeddings in Machine Learning Using Knowledge Completion with Validation in a Case Study on COVID-19 Spread
Rosario Napoli, Gabriele Morabito, Antonio Celesti, Massimo Villari, Maria Fazio
TL;DR
The paper tackles the limitation that standard graph embeddings miss latent, implicit knowledge in sparse datasets. It introduces a dedicated Knowledge Completion (KC) phase that models scalable transitive relationships with decay-based inference to complete the graph prior to embedding. The KC-enhanced GML pipeline materially alters embedding space geometry and centrality dynamics, as demonstrated on a temporal COVID-19 contact network, improving the expressiveness of both Node2Vec and GraphSAGE embeddings. The findings suggest KC is a transformative pre-processing step, with practical implications for more accurate propagation and centrality analysis in knowledge-rich graphs, and motivate further study on downstream task performance and scalability.
Abstract
The rise of graph-structured data has driven major advances in Graph Machine Learning (GML), where graph embeddings (GEs) map features from Knowledge Graphs (KGs) into vector spaces, enabling tasks like node classification and link prediction. However, since GEs are derived from explicit topology and features, they may miss crucial implicit knowledge hidden in seemingly sparse datasets, affecting graph structure and their representation. We propose a GML pipeline that integrates a Knowledge Completion (KC) phase to uncover latent dataset semantics before embedding generation. Focusing on transitive relations, we model hidden connections with decay-based inference functions, reshaping graph topology, with consequences on embedding dynamics and aggregation processes in GraphSAGE and Node2Vec. Experiments show that our GML pipeline significantly alters the embedding space geometry, demonstrating that its introduction is not just a simple enrichment but a transformative step that redefines graph representation quality.
