Synthetic graphs for link prediction benchmarking
Alexey Vlaskin, Eduardo G. Altmann
TL;DR
The paper tackles the challenge of benchmarking link-prediction algorithms by designing synthetic graphs that jointly embed micro-scale motifs and meso-scale communities, enabling a closed-form calculation of the maximum achievable prediction quality via an ideal predictor. It evaluates four representative methods—Adamic–Adar, SBM, Node2Vec, and GraphSage—across varied structural regimes, revealing that method success is strongly tied to the underlying graph structure and that no single approach excels in all cases. A key contribution is the analytical upper bound on $AUC$ (predictability) derived from the synthetic graph parameters, providing a principled benchmark to assess how close real methods come to the theoretical limit. The findings underscore the importance of structure-aware benchmarking in link prediction and suggest that combining methods and expanding synthetic benchmarks can drive the development of more robust predictive techniques, with open-source generation code to foster further research.
Abstract
Predicting missing links in complex networks requires algorithms that are able to explore statistical regularities in the existing data. Here we investigate the interplay between algorithm efficiency and network structures through the introduction of suitably-designed synthetic graphs. We propose a family of random graphs that incorporates both micro-scale motifs and meso-scale communities, two ubiquitous structures in complex networks. A key contribution is the derivation of theoretical upper bounds for link prediction performance in our synthetic graphs, allowing us to estimate the predictability of the task and obtain an improved assessment of the performance of any method. Our results on the performance of classical methods (e.g., Stochastic Block Models, Node2Vec,GraphSage) show that the performance of all methods correlate with the theoretical predictability, that no single method is universally superior, and that each of the methods exploit different characteristics known to exist in large classes of networks. Our findings underline the need for careful consideration of graph structure when selecting a link prediction method and emphasize the value of comparing performance against synthetic benchmarks. We provide open-source code for generating these synthetic graphs, enabling further research on link prediction methods.
