Table of Contents
Fetching ...

Synthetic graphs for link prediction benchmarking

Alexey Vlaskin, Eduardo G. Altmann

TL;DR

The paper tackles the challenge of benchmarking link-prediction algorithms by designing synthetic graphs that jointly embed micro-scale motifs and meso-scale communities, enabling a closed-form calculation of the maximum achievable prediction quality via an ideal predictor. It evaluates four representative methods—Adamic–Adar, SBM, Node2Vec, and GraphSage—across varied structural regimes, revealing that method success is strongly tied to the underlying graph structure and that no single approach excels in all cases. A key contribution is the analytical upper bound on $AUC$ (predictability) derived from the synthetic graph parameters, providing a principled benchmark to assess how close real methods come to the theoretical limit. The findings underscore the importance of structure-aware benchmarking in link prediction and suggest that combining methods and expanding synthetic benchmarks can drive the development of more robust predictive techniques, with open-source generation code to foster further research.

Abstract

Predicting missing links in complex networks requires algorithms that are able to explore statistical regularities in the existing data. Here we investigate the interplay between algorithm efficiency and network structures through the introduction of suitably-designed synthetic graphs. We propose a family of random graphs that incorporates both micro-scale motifs and meso-scale communities, two ubiquitous structures in complex networks. A key contribution is the derivation of theoretical upper bounds for link prediction performance in our synthetic graphs, allowing us to estimate the predictability of the task and obtain an improved assessment of the performance of any method. Our results on the performance of classical methods (e.g., Stochastic Block Models, Node2Vec,GraphSage) show that the performance of all methods correlate with the theoretical predictability, that no single method is universally superior, and that each of the methods exploit different characteristics known to exist in large classes of networks. Our findings underline the need for careful consideration of graph structure when selecting a link prediction method and emphasize the value of comparing performance against synthetic benchmarks. We provide open-source code for generating these synthetic graphs, enabling further research on link prediction methods.

Synthetic graphs for link prediction benchmarking

TL;DR

The paper tackles the challenge of benchmarking link-prediction algorithms by designing synthetic graphs that jointly embed micro-scale motifs and meso-scale communities, enabling a closed-form calculation of the maximum achievable prediction quality via an ideal predictor. It evaluates four representative methods—Adamic–Adar, SBM, Node2Vec, and GraphSage—across varied structural regimes, revealing that method success is strongly tied to the underlying graph structure and that no single approach excels in all cases. A key contribution is the analytical upper bound on (predictability) derived from the synthetic graph parameters, providing a principled benchmark to assess how close real methods come to the theoretical limit. The findings underscore the importance of structure-aware benchmarking in link prediction and suggest that combining methods and expanding synthetic benchmarks can drive the development of more robust predictive techniques, with open-source generation code to foster further research.

Abstract

Predicting missing links in complex networks requires algorithms that are able to explore statistical regularities in the existing data. Here we investigate the interplay between algorithm efficiency and network structures through the introduction of suitably-designed synthetic graphs. We propose a family of random graphs that incorporates both micro-scale motifs and meso-scale communities, two ubiquitous structures in complex networks. A key contribution is the derivation of theoretical upper bounds for link prediction performance in our synthetic graphs, allowing us to estimate the predictability of the task and obtain an improved assessment of the performance of any method. Our results on the performance of classical methods (e.g., Stochastic Block Models, Node2Vec,GraphSage) show that the performance of all methods correlate with the theoretical predictability, that no single method is universally superior, and that each of the methods exploit different characteristics known to exist in large classes of networks. Our findings underline the need for careful consideration of graph structure when selecting a link prediction method and emphasize the value of comparing performance against synthetic benchmarks. We provide open-source code for generating these synthetic graphs, enabling further research on link prediction methods.

Paper Structure

This paper contains 18 sections, 9 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An example of the synthetic graphs we use as benchmarks for link-prediction methods. In this example, there are $N_{B}=4$ bridge nodes and $M=4$ structures. Each structure is a $k=4$ 2d lattice so that the number of structure nodes is $N_{S} = 4 \times 4 \times 4$. Bridge nodes connect randomly to structural nodes only with probability $D_{B} / N_{S}$, with $D_b=4$ used in this example.
  • Figure 2: Maximum prediction score in the synthetic graphs. The straight lines $\vec{AB}$ and $\vec{BC}$ compose the prediction curve $(FPR(t),TPR(t))$ for $t\in[0,1]$ of the ideal prediction method in our synthetic graph. The point A is located at $(0,\widetilde{e_{SB}})$, the point $B$ at $(\overline{e_{SS}},1)$, and C at $(1,1)$. See Tab. \ref{['tab.parameters']} for the formulas expressing $\widetilde{e_{SB}}$ and $\overline{e_{SS}}$ as a function of model parameters. See \ref{['app.analyticalideal']} for a derivation of these points.
  • Figure 3: Performance of link prediction methods (see legend) in networks with increasing number of structures $M$ and bridge nodes $N_b$. The networks have average bridge degree $D_{B} = 5$, the structures are 2-d lattices with $k=8$ and closed diagonals, and various $N_{B} = M$ (x-axis).
  • Figure 4: Performance of link prediction methods for graphs with increasing fraction $C_{S}$ of structure nodes. Graphs with $N = 3200$ nodes were build using two types of structures: cliques (left panel) and 2d lattices (right) and bridges. $k = 8$, so clique has 8 nodes each and lattice has $8 \times 8 = 64$ nodes each. Probability of bridge $\alpha = 12 / ( N \cdot C_{S} )$. Please see \ref{['app.tables']} - \ref{['faaa-table-clique']} and \ref{['faab-table-lattice']} for the exact generation parameters.
  • Figure 5: Performance of link prediction methods for structures of growing size $k$. The number of structures $M = 4$, the ratio of structure nodes $C_{S} = 0.75$, and the average bridge degree $D_{B}$ are kept fixed. Left: we use square 2d lattices, where diagonals are not closed. Right: we use square 2d lattices with closed one diagonal forming closed triangles, as illustrated in Fig. \ref{['fig.latticegraphwithdiags']}. In both cases, the number of structure nodes grows with $k$ as $N_s = 4k^2$ and bridge nodes as $N_{B} = (1 - 0.75)\cdot N$, to keep $C_{S}=0.75$ constant.
  • ...and 4 more figures