The maximum capability of a topological feature in link prediction

Yijun Ran; Xiao-Ke Xu; Tao Jia

The maximum capability of a topological feature in link prediction

Yijun Ran, Xiao-Ke Xu, Tao Jia

TL;DR

The paper proves a universal upper bound on the predictive capability of any topological feature for link prediction, showing that the bound depends only on the fractions of missing and nonexistent links that carry the feature, via $p_1$ and $p_2$. It demonstrates that all indexes within a feature family share this upper bound, while supervised learning can lift the bound by $\Delta=(1-p_1)p_2$, yielding $\text{AUC}'_{upper}$ (and a corresponding precision analogue). The authors validate the theory across 550 networks and provide explicit expressions for $p_1$ and $p_2$ for the common neighbor feature, linking them to motif counts such as closed and open triangles. The framework offers practical guidance for feature and method selection and reveals that network structure beyond clustering (e.g., open-triangle motifs) crucially influences feature effectiveness. Overall, the results deliver a principled, quantitative tool for assessing and optimizing topological features in link prediction.

Abstract

Networks offer a powerful approach to modeling complex systems by representing the underlying set of pairwise interactions. Link prediction is the task that predicts links of a network that are not directly visible, with profound applications in biological, social, and other complex systems. Despite intensive utilization of the topological feature in this task, it is unclear to what extent a feature can be leveraged to infer missing links. Here, we aim to unveil the capability of a topological feature in link prediction by identifying its prediction performance upper bound. We introduce a theoretical framework that is compatible with different indexes to gauge the feature, different prediction approaches to utilize the feature, and different metrics to quantify the prediction performance. The maximum capability of a topological feature follows a simple yet theoretically validated expression, which only depends on the extent to which the feature is held in missing and nonexistent links. Because a family of indexes based on the same feature shares the same upper bound, the potential of all others can be estimated from one single index. Furthermore, a feature's capability is lifted in the supervised prediction, which can be mathematically quantified, allowing us to estimate the benefit of applying machine learning algorithms. The universality of the pattern uncovered is empirically verified by 550 structurally diverse networks. The findings have applications in feature and method selection, and shed light on network characteristics that make a topological feature effective in link prediction.

The maximum capability of a topological feature in link prediction

TL;DR

and

. It demonstrates that all indexes within a feature family share this upper bound, while supervised learning can lift the bound by

, yielding

(and a corresponding precision analogue). The authors validate the theory across 550 networks and provide explicit expressions for

and

for the common neighbor feature, linking them to motif counts such as closed and open triangles. The framework offers practical guidance for feature and method selection and reveals that network structure beyond clustering (e.g., open-triangle motifs) crucially influences feature effectiveness. Overall, the results deliver a principled, quantitative tool for assessing and optimizing topological features in link prediction.

Abstract

Paper Structure (14 sections, 49 equations, 40 figures, 7 tables)

This paper contains 14 sections, 49 equations, 40 figures, 7 tables.

The 21 indexes used in this study and the classification of these indexes
The results associated with Preferential Attachment
Two alternative experiment setups
A topological feature's maximum capability measured by precision
A topological feature's maximum capability measured by AUC-mROC
Extended discussion on the lowest index value
Extended discussion on the scaling of the unsupervised prediction performance
Test on other machine learning algorithms
The optimal score ranking
Another example of feature and index selection in link prediction
The theoretical expression of $p_{1}$ and $p_{2}$
Extended discussion on the prediction performance measured by precision
The analysis for 10% random removal links
The detailed information of empirical dataset

Figures (40)

Figure 1: An illustration of different link prediction performance. ( a) Samples in the positive set $L^P$ can be divided into two subsets based on whether the feature is held or not. $L_1$ is the subset of $L^P$ in which node pairs hold the feature, whereas the complement set $\overline{L}_1$ is composed of node pairs that do not hold the feature. As the index is designed to quantify the feature, it should assign non-zero values to samples in $L_1$ and value 0 to samples in $\overline{L}_1$. Similarly, the negative set $L^N$ can also be divided into two subsets $L_2$ and $\overline{L}_2$. Assume that $L_1$ takes a fraction $p_1$ of $L^P$ and $L_2$ takes a fraction $p_2$ of $L^N$. Because samples in $\overline{L}_1$ and $\overline{L}_2$ have the same index value 0, the prediction performance mainly relies on the ranking of $L_1$ and $L_2$. ( b) The worst index value ranking is when $L_2$ is systematically ranked ahead of $L_1$. ( c) The best index value ranking is just the opposite when $L_1$ is systematically ranked ahead of $L_2$. Note that in both cases, $\overline{L}_1$ and $\overline{L}_2$ are always ranked behind $L_1$ and $L_2$ in the unsupervised approach. ( d) In supervised prediction, the machine learning based classifier can find a mapping function $y=f(x)$ to transfer the index value to the score for prediction. Hence, the relative position among $L_1$, $L_2$ and $\overline{L}_1 \cup \overline{L}_2$ can be further optimized. Because samples in $\overline{L}_1 \cup \overline{L}_2$ have the same index value, they should have the same score. The optimal score ranking is to assign a score to $\overline{L}_1 \cup \overline{L}_2$ that makes it lie between $L_1$ and $L2$. In this case, no negative samples have a higher score than positive samples. Note that different scenarios described here can also be used to explain different precision values obtained (see Supplementary Section \ref{['section:s4']}).
Figure 2: The scaling of the AUC values. Eq. (\ref{['equation:lower']}) and Eq. (\ref{['equation:upper']}) suggest that the actual prediction by an index fluctuates within $p_1 \times p_2$. Therefore, for the common neighbor feature ($\bf{a}$) and the path feature ($\bf{b}$) whose $p_1 \times p_2$ values are small, the link prediction performance by different indexes roughly scales as $p_{1}-p_{2}$. For each network, we randomly generate 200 realizations of networks with link removal, as well as 200 pairs of $L^P$ and $L^N$ sets (Materials and Methods). The $p_{1}$, $p_{2}$, and the corresponding AUC obtained may vary slightly in different sampled $L^P$'s and $L^N$'s. In the figure, we use the average value.
Figure 3: The actual and predicted improvement by the supervised approach. Eq. (\ref{['equation:upper2']}) suggests that the supervised approach can lift the capability of a feature by $(1-p_{1})p_{2}$. To test it, we select networks in which the unsupervised prediction by an index is already close to its upper bound (measured AUC is more than 95% of $\text{AUC}_\text{upper}$). For these networks, we input the same index values of $L^P$ and $L^N$ to the classifier to obtain the supervised prediction results. For each network, we randomly generate 200 realizations of networks with link removal, as well as 200 pairs of $L^P$ and $L^N$ sets (Materials and Methods). We pick the network realization that gives rise to the highest AUC in supervised prediction. $\Delta$ for a given network is measured as the AUC difference between the supervised and unsupervised prediction for that particular network realization. The empirically measured $\Delta$ for different networks and indexes are close to $(1-p_{1})p_{2}$, in line with the prediction by Eq. (\ref{['equation:upper2']}).
Figure 4: The structural characteristics related to the common neighbor feature in link prediction. ( a, b) It is intuitively expected that the clustering coefficient $C$ is directly related to the performance of indexes based on the common neighbor feature. But the unsupervised ( a) and supervised ( b) prediction results show that $C$ can not fully explain the effectiveness of the common neighbor feature. AUC demonstrates a significant variability for some $C$ values in both cases. ( c, d) According to Eq. (\ref{['equation:p1']}) and Eq. (\ref{['equation:p2']}), $p_{1}$ depends on the number of closed triangles, and $p_{2}$ depends on the number of open triangles. Therefore, $p_{1}$ should demonstrate a strong correlation with the clustering coefficient $C$, and $p_{2}$ should be independent of $C$, which are empirically confirmed. The $r$ is the Pearson correlation coefficient, and the p-value is from the Student's t-test.
Figure S1: An example of the link prediction problem. The common process of link prediction is that a set of existing links is removed randomly from the original network, which is marked as the positive testing set $L^P$ (the red dash lines in the training network). As the control group, a random set of node pairs that are not connected in the original network is selected as the negative testing set $L^N$ (the blue dash lines in the training network). An index considers the topology based on the rest of the links $L^T$ (the black solid lines in the training network) and assigns a value to each node pair in $L^P$ and $L^N$. In unsupervised prediction, the index values are directly used as the score of samples in $L^P$ and $L^N$. Assume that according to the index value we have $S_{16} = 0.3, S_{34} = 0.58, S_{13} = 0.2$, and $S_{46} = 0.58$. The prediction quality is measured by how samples in $L^P$ are ranked ahead of those in $L^N$. When using AUC to measure the prediction performance, we usually apply the random sampling approach. In each comparison, we randomly draw a node pair from $L^P$ and a node pair from $L^N$, and compare their scores. Suppose 3 random comparisons are made. Node pairs 1-6 and 1-3, node pairs 1-6 and 4-6, and node pairs 3-4 and 4-6 are selected in each comparison. We have one case where node pair from $L^P$ outscores that from $L^N$, and one case where node pairs from $L^P$ and $L^N$ have an equal score. According to Eq. (\ref{['equation:auc']}) of the main text, the AUC can be estimated as $\frac{1+0.5}{3} = 0.5$. When using precision to measure the prediction performance, we rank node pairs according to their scores in descending order. In the example shown, the rank is 3-4, 4-6, 1-6, 1-3. If we select the hyper-parameter $L_\text{k}=2$, the top-two node pairs (3-4 and 4-6) are considered. As node pair 3-4 is the true positive sample whereas node pair 4-6 is not, the precision is 0.5.
...and 35 more figures

The maximum capability of a topological feature in link prediction

TL;DR

Abstract

The maximum capability of a topological feature in link prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (40)