The maximum capability of a topological feature in link prediction
Yijun Ran, Xiao-Ke Xu, Tao Jia
TL;DR
The paper proves a universal upper bound on the predictive capability of any topological feature for link prediction, showing that the bound depends only on the fractions of missing and nonexistent links that carry the feature, via $p_1$ and $p_2$. It demonstrates that all indexes within a feature family share this upper bound, while supervised learning can lift the bound by $\Delta=(1-p_1)p_2$, yielding $\text{AUC}'_{upper}$ (and a corresponding precision analogue). The authors validate the theory across 550 networks and provide explicit expressions for $p_1$ and $p_2$ for the common neighbor feature, linking them to motif counts such as closed and open triangles. The framework offers practical guidance for feature and method selection and reveals that network structure beyond clustering (e.g., open-triangle motifs) crucially influences feature effectiveness. Overall, the results deliver a principled, quantitative tool for assessing and optimizing topological features in link prediction.
Abstract
Networks offer a powerful approach to modeling complex systems by representing the underlying set of pairwise interactions. Link prediction is the task that predicts links of a network that are not directly visible, with profound applications in biological, social, and other complex systems. Despite intensive utilization of the topological feature in this task, it is unclear to what extent a feature can be leveraged to infer missing links. Here, we aim to unveil the capability of a topological feature in link prediction by identifying its prediction performance upper bound. We introduce a theoretical framework that is compatible with different indexes to gauge the feature, different prediction approaches to utilize the feature, and different metrics to quantify the prediction performance. The maximum capability of a topological feature follows a simple yet theoretically validated expression, which only depends on the extent to which the feature is held in missing and nonexistent links. Because a family of indexes based on the same feature shares the same upper bound, the potential of all others can be estimated from one single index. Furthermore, a feature's capability is lifted in the supervised prediction, which can be mathematically quantified, allowing us to estimate the benefit of applying machine learning algorithms. The universality of the pattern uncovered is empirically verified by 550 structurally diverse networks. The findings have applications in feature and method selection, and shed light on network characteristics that make a topological feature effective in link prediction.
