Table of Contents
Fetching ...

Uncovering multi-order Popularity and Similarity Mechanisms in Link Prediction by graphlet predictors

Yong-Jian He, Yijun Ran, Zengru Di, Tao Zhou, Xiao-Ke Xu

TL;DR

This work introduces graphlet orbit degrees as a unified, multi-order representation of popularity and similarity mechanisms for link prediction. By representing traditional indices through node- and edge-orbit degrees and fusing them with XGBoost, the proposed OD framework achieves state-of-the-art performance across 550 real-world networks from six domains, while also enabling interpretability via SHAP analyses. The results reveal dominant roles for first-order similarity (notably M2 in social networks) and domain-specific patterns (e.g., M3 in economic/tech/info networks) with no single feature dominating biological or transportation networks. Overall, the approach provides both higher predictive accuracy and deeper mechanistic insights into how network structure drives link formation, with broad applicability to network analysis tasks beyond link prediction.

Abstract

Link prediction has become a critical problem in network science and has thus attracted increasing research interest. Popularity and similarity are two primary mechanisms in the formation of real networks. However, the roles of popularity and similarity mechanisms in link prediction across various domain networks remain poorly understood. Accordingly, this study used orbit degrees of graphlets to construct multi-order popularity- and similarity-based network link predictors, demonstrating that traditional popularity- and similarity-based indices can be efficiently represented in terms of orbit degrees. Moreover, we designed a supervised learning model that fuses multiple orbit-degree-based features and validated its link prediction performance. We also evaluated the mean absolute Shapley additive explanations of each feature within this model across 550 real-world networks from six domains. We observed that the homophily mechanism, which is a similarity-based feature, dominated social networks, with its win rate being 91\%. Moreover, a different similarity-based feature was prominent in economic, technological, and information networks. Finally, no single feature dominated the biological and transportation networks. The proposed approach improves the accuracy and interpretability of link prediction, thus facilitating the analysis of complex networks.

Uncovering multi-order Popularity and Similarity Mechanisms in Link Prediction by graphlet predictors

TL;DR

This work introduces graphlet orbit degrees as a unified, multi-order representation of popularity and similarity mechanisms for link prediction. By representing traditional indices through node- and edge-orbit degrees and fusing them with XGBoost, the proposed OD framework achieves state-of-the-art performance across 550 real-world networks from six domains, while also enabling interpretability via SHAP analyses. The results reveal dominant roles for first-order similarity (notably M2 in social networks) and domain-specific patterns (e.g., M3 in economic/tech/info networks) with no single feature dominating biological or transportation networks. Overall, the approach provides both higher predictive accuracy and deeper mechanistic insights into how network structure drives link formation, with broad applicability to network analysis tasks beyond link prediction.

Abstract

Link prediction has become a critical problem in network science and has thus attracted increasing research interest. Popularity and similarity are two primary mechanisms in the formation of real networks. However, the roles of popularity and similarity mechanisms in link prediction across various domain networks remain poorly understood. Accordingly, this study used orbit degrees of graphlets to construct multi-order popularity- and similarity-based network link predictors, demonstrating that traditional popularity- and similarity-based indices can be efficiently represented in terms of orbit degrees. Moreover, we designed a supervised learning model that fuses multiple orbit-degree-based features and validated its link prediction performance. We also evaluated the mean absolute Shapley additive explanations of each feature within this model across 550 real-world networks from six domains. We observed that the homophily mechanism, which is a similarity-based feature, dominated social networks, with its win rate being 91\%. Moreover, a different similarity-based feature was prominent in economic, technological, and information networks. Finally, no single feature dominated the biological and transportation networks. The proposed approach improves the accuracy and interpretability of link prediction, thus facilitating the analysis of complex networks.
Paper Structure (6 sections, 25 equations, 12 figures, 5 tables)

This paper contains 6 sections, 25 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Graphlet orbits, orbit-degree-based link predictors, and orbit degree distribution. a Fifteen node orbits characterised by graphlets with 2–4 nodes. In each graphlet, nodes with different gray levels belong to different node orbits. b Fifteen multi-order popularity-based predictors, with the yellow node in each node orbit representing one endpoint of the target link for prediction (called target node in the following text). The order is defined by the length of the longest path starting from the target node. In these features, we define first-order popularity-based predictors as lower-order features, and second- and third-order predictors as higher-order, with the same applied to similarity predictors. c Twelve edge orbits characterised by graphlets comprising 3–4 nodes. In each graphlet, edges with different colors belong to different edge orbits. d Twelve multi-order similarity predictors, with the dotted red line in each edge orbit representing the target link for prediction. The order is defined by the length of the longest path starting from one endpoint without touching or passing through the other. e and f Distributions of normalized popularities $\widetilde{S^{Ni}_{xy}}=\frac{S^{Ni}_{\max}-S^{Ni}_{xy}}{S^{Ni}_{\max}-S^{Ni}_{\min}}$ and normalized similarities $\widetilde{S^{Mj}_{xy}}=\frac{S^{Mj}_{\max}-S^{Mj}_{xy}}{S^{Mj}_{\max}-S^{Mj}_{\min}}$ for existent and nonexistent links in a contact network of high school students, where $S^{Ni}_{\max}$ and $S^{Mj}_{\max}$ are maximum values, and $S^{Ni}_{\min}$ and $S^{Mj}_{\min}$ are minimum values over considered links for Eq. \ref{['Equation_N']} and Eq. \ref{['Equation_M']}, respectively. For most orbit degrees, the distributions for existent and nonexistent links are visually distinct.
  • Figure 2: Decomposing known popularity-based and similarity-based indices by orbit degrees.a The degrees of nodes $x$ and $y$ are captured by orbit N1. Thus, the PA index of these nodes is simply the product of their numbers of N1 orbits. b The CN index indicates the number of common neighbors for nodes $x$ and $y$, which can be directly mirrored by M2. c The CAR index depends on M2, which reflects the common neighbors, and M12, which represents the local community links. d The CN-L3 index indicates the combined effects of edge orbits M9, M11, and M12 around nodes $x$ and $y$, with each $M_9({x,y})$ or $M_{11}({x,y})$ contributing one to the score $S_{xy}^{CN-L3}$ and each $M_{12}({x,y})$ contributing two. e The MS index measures the similarity between nodes $x$ and $y$ by looking at their common neighbors and triangles they form, which can be fully represented by M2 and N4.
  • Figure 3: Results of feature analyses for the board membership network of Norwegian public limited companies.a Top 10 features for the XGBoost model, determined on the basis of mean absolute SHAP values. The blue and orange colors denote similarity and popularity features, respectively. The two highest-ranked features, namely M2 and M4, are first-order similarity features, which preliminarily indicates the dominant role of first-order similarity mechanisms in link formation of the examined network. b SHAP values of the top 10 features for all positive and negative samples, with the color of data points reflecting the feature values of corresponding samples.
  • Figure 4: Contribution share analysis of multi-order popularity and similarity feature categories.a The contribution share of five feature categories for the board membership network of Norwegian public limited companies. b The winning rates of the five feature categories in the six domains, which are determined based on their contribution shares in each network.
  • Figure 5: Different feature roles in networks from different domains.a PCA scatter plot of networks. In this scatter plot, each network is represented by a 27-dimensional vector composed of the SHAP values corresponding to the 27 orbit degrees. After these vectors are reduced to two-dimensional ones, data points (corresponding to networks) are visualised with different colors representing different domains. b Variations in the winning rates of all features for the six domains, subject to the highest mean absolute SHAP values in each network. c The winning rates across all features in biological networks with different sub-domains.
  • ...and 7 more figures