Table of Contents
Fetching ...

Revisiting Link Prediction: A Data Perspective

Haitao Mao, Juanhui Li, Harry Shomer, Bingheng Li, Wenqi Fan, Yao Ma, Tong Zhao, Neil Shah, Jiliang Tang

TL;DR

This paper unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient, and (ii) the incompatibility can be found between feature and structural proximity.

Abstract

Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. However, since datasets span a multitude of domains, they could have distinct underlying mechanisms of link formation. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. In this paper, we endeavor to explore principles of link prediction across diverse datasets from a data-centric perspective. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity. We then unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient. (ii) The incompatibility can be found between feature and structural proximity. Such incompatibility leads to GNNs for Link Prediction (GNN4LP) consistently underperforming on edges where the feature proximity factor dominates. Inspired by these new insights from a data perspective, we offer practical instruction for GNN4LP model design and guidelines for selecting appropriate benchmark datasets for more comprehensive evaluations.

Revisiting Link Prediction: A Data Perspective

TL;DR

This paper unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient, and (ii) the incompatibility can be found between feature and structural proximity.

Abstract

Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction. However, since datasets span a multitude of domains, they could have distinct underlying mechanisms of link formation. Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets. In this paper, we endeavor to explore principles of link prediction across diverse datasets from a data-centric perspective. We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity. We then unearth relationships among those factors where (i) global structural proximity only shows effectiveness when local structural proximity is deficient. (ii) The incompatibility can be found between feature and structural proximity. Such incompatibility leads to GNNs for Link Prediction (GNN4LP) consistently underperforming on edges where the feature proximity factor dominates. Inspired by these new insights from a data perspective, we offer practical instruction for GNN4LP model design and guidelines for selecting appropriate benchmark datasets for more comprehensive evaluations.
Paper Structure (42 sections, 15 theorems, 33 equations, 12 figures, 11 tables)

This paper contains 42 sections, 15 theorems, 33 equations, 12 figures, 11 tables.

Key Result

Proposition 1

For any $\delta>0$, with probability at least $1-2\delta$, we have $d_{i j} \leq 2 \sqrt{r^{max}_{i j}-\left(\frac{\eta_{i j} / N-\epsilon}{V(1)}\right)^{2 / D}}$, where $\eta_{i j}$ is the number of common neighbors between nodes $i$ and $j$, $r^{max}_{i j} = max\{r_{i},r_{j}\}$, and $V(1)$ is the

Figures (12)

  • Figure 1: Distribution disparity of Common Neighbors across datasets.
  • Figure 2: Performance of heuristics corresponding to different factors.
  • Figure 3: Overlapping ratio between top-ranked edges on different heuristic algorithms. Diagonals are the comparison between two heuristics within the same factor, while others compare heuristics from different factors. FP is ignored on ogbl-ddi and ogbl-ppa due to no or weak feature quality. MRR is selected as the metric. More results on hit@10 metric can be found in Appendix \ref{['app:more-dataset']}.
  • Figure 4: Performance comparison between GNN4LP models and SAGE on the ogbl-collab dataset. Bars represent the performance gap on node pairs dominated by feature and structural proximity, respectively. Figures correspond to compare FP with GSP and LSP, respectively
  • Figure 5: The original SEAL and the proposed decoupled SEAL architectures. $\mathbf{X}_{\text{feat}}$ and $\mathbf{X}_{\text{drnl}}$ are the original node feature and the structural embedding via Double-Radius Node Labeling.
  • ...and 7 more figures

Theorems & Definitions (17)

  • Proposition 1: latent space distance bound with CNs
  • Proposition 2: latent space distance bound with the number of paths
  • Proposition 3: latent space distance bound with feature proximity
  • Lemma 1: latent space distance bound with local and global structural proximity
  • Lemma 2: Incompatibility between LSP and FP factors
  • Definition 1: Latent space model for link prediction
  • Proposition 1: latent space distance bound with CNs
  • Proposition 2: latent space distance bound with the number of paths
  • Definition 1
  • Lemma 1
  • ...and 7 more