Inconsistency of evaluation metrics in link prediction

Yilin Bi; Xinshan Jiao; Yan-Li Lee; Tao Zhou

Inconsistency of evaluation metrics in link prediction

Yilin Bi, Xinshan Jiao, Yan-Li Lee, Tao Zhou

TL;DR

This paper tackles the problem that evaluation metrics for link-prediction algorithms can yield inconsistent algorithm rankings. Through a large-scale study across $340$ real networks and $25$ algorithms, it reveals substantial inconsistency among common metrics, with only $AUPR$, $AUC$-Precision, and $NDCG$ forming a tight cluster, while $AUC$-mROC remains relatively independent. It proves that for a fixed threshold $k$, threshold-dependent metrics are rank-equivalent, and offers practical guidance to use $AUC$ plus one of $AUPR$, $AUC$-Precision, or $NDCG$, reserving threshold-dependent metrics for problem-specific thresholds and recommending $AUC$-mROC only when very few top predictions matter. The work provides four actionable guidelines for metric selection and releases data and code to enable fair, reproducible benchmarking in link prediction, aiming to standardize evaluation practices in the field.

Abstract

Link prediction is a paradigmatic and challenging problem in network science, which aims to predict missing links, future links and temporal links based on known topology. Along with the increasing number of link prediction algorithms, a critical yet previously ignored risk is that the evaluation metrics for algorithm performance are usually chosen at will. This paper implements extensive experiments on hundreds of real networks and 25 well-known algorithms, revealing significant inconsistency among evaluation metrics, namely different metrics probably produce remarkably different rankings of algorithms. Therefore, we conclude that any single metric cannot comprehensively or credibly evaluate algorithm performance. Further analysis suggests the usage of at least two metrics: one is the area under the receiver operating characteristic curve (AUC), and the other is one of the following three candidates, say the area under the precision-recall curve (AUPR), the area under the precision curve (AUC-Precision), and the normalized discounted cumulative gain (NDCG). In addition, as we have proved the essential equivalence of threshold-dependent metrics, if in a link prediction task, some specific thresholds are meaningful, we can consider any one threshold-dependent metric with those thresholds. This work completes a missing part in the landscape of link prediction, and provides a starting point toward a well-accepted criterion or standard to select proper evaluation metrics for link prediction.

Inconsistency of evaluation metrics in link prediction

TL;DR

This paper tackles the problem that evaluation metrics for link-prediction algorithms can yield inconsistent algorithm rankings. Through a large-scale study across

real networks and

algorithms, it reveals substantial inconsistency among common metrics, with only

-Precision, and

forming a tight cluster, while

-mROC remains relatively independent. It proves that for a fixed threshold

, threshold-dependent metrics are rank-equivalent, and offers practical guidance to use

plus one of

-Precision, or

, reserving threshold-dependent metrics for problem-specific thresholds and recommending

-mROC only when very few top predictions matter. The work provides four actionable guidelines for metric selection and releases data and code to enable fair, reproducible benchmarking in link prediction, aiming to standardize evaluation practices in the field.

Abstract

Paper Structure (15 sections, 23 equations, 9 figures, 1 table)

This paper contains 15 sections, 23 equations, 9 figures, 1 table.

Introduction
Results
Inconsistency among Metrics
Quandary of Threshold-dependent Metrics
Correlation Graph Analysis
Discussion
Materials and Methods
Algorithms of Link Prediction
Evaluation Metrics
Equivalence of Threshold-dependent Metrics
Ranking Correlation Coefficients
Data and Codes
Supplemental Information
Sensitivity Analysis
The Alternative Method

Figures (9)

Figure 1: Schematic flowchart of the proposed method to measure the correlation between any two evaluation metrics $M_1$ and $M_2$. (A) Initially, the original network is divided into training and probe sets at a ratio, for example 9:1. Next, the evaluation scores of different algorithms $A_{i} (i=1, 2,\dots, P)$ (here we show an example for $P=5$) are calculated by $M_{1}$ and $M_{2}$. The average scores can be obtained by multiple implementations with different random divisions of training and probe sets. Based on the average scores, we can get two rankings of algorithms corresponding to $M_1$ and $M_2$, respectively. (B) We select a large number of real-world networks $G_1, G_2, \cdots, G_Q$, and for each network $G_i$ and each metric $M_j$, we can obtain a ranking of the $P$ algorithms (here we show an example for $Q=3$). (C) We calculate the correlation coefficient of $M_{1}$ and $M_{2}$ by applying some ranking correlation coefficients (e.g., the Spearman correlation coefficient Spearman1987wangpei2020 and the Kendall's $\tau$ correlation coefficient Kendall1938wangpei2020) and averaging over the $Q$ selected networks.
Figure 2: The trend of correlations between metrics as the increase of $Q$. For each $Q$, we implement 10 independent runs, where in each run we randomly select $Q$ networks from the collection of 340 real networks. Here the threshold for Precision is set as $k=0.1 \cdot|U-E^{T}|$.
Figure 3: The change of correlations between Precision@$k$ and threshold-free metrics for varying $k$. In the main plots, we set $k=\rho|U-E^{T}|$, and in the insets, we set $k=\gamma|E^P|$. The average Spearman rank correlation coefficients correspond to $Q=300$. (A)-(E) respectively show the cases for AUC, AUPR, AUC-Precision, NDCG, and AUC-mROC.
Figure 4: The Spearman rank correlation coefficients for all metric pairs, averaged over 10 independent runs and 300 selected networks in each run. For Precision, the threshold is set as $k=0.1\cdot |U-E^{T}|$. The top-right corner shows the corresponding correlation graph, with the thickness of each link representing the strength of correlation.
Figure 5: The average pairwise correlations over 300 randomly selected real networks for different splitting ratios of $|E^T|$ to $|E^P|$. The blue, red, green, and purple lines represent the results for $|E^T|:|E^P|=6:4$, $|E^T|:|E^P|=7:3$, $|E^T|:|E^P|=8:2$, and $|E^T|:|E^P|=9:1$, respectively.
...and 4 more figures

Theorems & Definitions (1)

proof

Inconsistency of evaluation metrics in link prediction

TL;DR

Abstract

Inconsistency of evaluation metrics in link prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)