Table of Contents
Fetching ...

Attribute-Enhanced Similarity Ranking for Sparse Link Prediction

João Mattos, Zexi Huang, Mert Kosan, Ambuj Singh, Arlei Silva

TL;DR

Gelato tackles link prediction in sparse graphs by reframing it as a similarity-ranking problem rather than binary classification. It integrates node attributes into topology through a lightweight graph-learning step, applies Autocovariance as a global topological heuristic, and trains with an N-pair ranking loss using partitioned negative sampling to expose hard negatives. Across four datasets, Gelato consistently outperforms state-of-the-art GNN-based methods under unbiased testing and scales efficiently thanks to sparse computations. The work also scrutinizes evaluation practices, arguing that unbiased, rank-based metrics are essential for meaningful assessment in sparse graphs.

Abstract

Link prediction is a fundamental problem in graph data. In its most realistic setting, the problem consists of predicting missing or future links between random pairs of nodes from the set of disconnected pairs. Graph Neural Networks (GNNs) have become the predominant framework for link prediction. GNN-based methods treat link prediction as a binary classification problem and handle the extreme class imbalance -- real graphs are very sparse -- by sampling (uniformly at random) a balanced number of disconnected pairs not only for training but also for evaluation. However, we show that the reported performance of GNNs for link prediction in the balanced setting does not translate to the more realistic imbalanced setting and that simpler topology-based approaches are often better at handling sparsity. These findings motivate Gelato, a similarity-based link-prediction method that applies (1) graph learning based on node attributes to enhance a topological heuristic, (2) a ranking loss for addressing class imbalance, and (3) a negative sampling scheme that efficiently selects hard training pairs via graph partitioning. Experiments show that Gelato outperforms existing GNN-based alternatives.

Attribute-Enhanced Similarity Ranking for Sparse Link Prediction

TL;DR

Gelato tackles link prediction in sparse graphs by reframing it as a similarity-ranking problem rather than binary classification. It integrates node attributes into topology through a lightweight graph-learning step, applies Autocovariance as a global topological heuristic, and trains with an N-pair ranking loss using partitioned negative sampling to expose hard negatives. Across four datasets, Gelato consistently outperforms state-of-the-art GNN-based methods under unbiased testing and scales efficiently thanks to sparse computations. The work also scrutinizes evaluation practices, arguing that unbiased, rank-based metrics are essential for meaningful assessment in sparse graphs.

Abstract

Link prediction is a fundamental problem in graph data. In its most realistic setting, the problem consists of predicting missing or future links between random pairs of nodes from the set of disconnected pairs. Graph Neural Networks (GNNs) have become the predominant framework for link prediction. GNN-based methods treat link prediction as a binary classification problem and handle the extreme class imbalance -- real graphs are very sparse -- by sampling (uniformly at random) a balanced number of disconnected pairs not only for training but also for evaluation. However, we show that the reported performance of GNNs for link prediction in the balanced setting does not translate to the more realistic imbalanced setting and that simpler topology-based approaches are often better at handling sparsity. These findings motivate Gelato, a similarity-based link-prediction method that applies (1) graph learning based on node attributes to enhance a topological heuristic, (2) a ranking loss for addressing class imbalance, and (3) a negative sampling scheme that efficiently selects hard training pairs via graph partitioning. Experiments show that Gelato outperforms existing GNN-based alternatives.

Paper Structure

This paper contains 32 sections, 5 theorems, 18 equations, 13 figures, 12 tables.

Key Result

Lemma 1

The ratio $\alpha$ between inter-cluster and intra-cluster negative node pairs in the SBM is such that:

Figures (13)

  • Figure 1: Gelato applies graph learning to incorporate attribute information into the topology. The learned graph is given to a topological heuristic that predicts edges between node pairs with high Autocovariance similarity. The parameters of the MLP are optimized end-to-end using the N-pair loss over node pairs selected via a partitioning-based negative sampling scheme. Experiments show that Gelato outperforms state-of-the-art GNN-based link prediction methods.
  • Figure 2: Scaling up Gelato using batching and sparse tensors. We represent sparse tensors (1 and 2) as matrices with blank entries and dense tensors (3 and 4) as color-filled matrices. We extract from the enhanced transition matrix (1) a slice $P_0$ (2) given a batch of node indices $V_{batch}$. Instead of a matrix exponentiation, we compute $P_0$$(\widetilde{D}^{-1}\widetilde{A})$ repeatedly for $t$ times to obtain $P_k$ (3), a dense tensor. Finally, we use $P_k$ to obtain the autocovariance $R$ (4) for nodes in the batch. This is implemented efficiently using dense-sparse tensor multiplication.
  • Figure 3: We analyze classification-based and similarity-based link prediction approaches through a comparison between the probability density functions of predicted similarities/scores by Gelato and NCN (state-of-the-art GNN), on the test set in three different regimes (biased, unbiased, and partitioned). Negative pairs are represented in red, and positive pairs are represented in blue. By treating link prediction as a similarity-based problem, Gelato presents better separation (smaller overlap) between the similarity curves in the harder scenarios, distinguishing between positive and negative pairs across all testing regimes. NCN presents a drastic increase in overlap as negative pairs become harder, struggling to separate positive and negative pairs.
  • Figure 4: Link prediction comparison in terms of $hits@k$ varying $k$ using Cora, CiteSeer, OGBL-DDI and OGBL-Collab. All datasets were split using unbiased sampling, except OGBL-Collab, which was split using partitioned sampling. Gelato outperforms the baselines across different values of $k$ and remains competitive on OGBL-DDI, a dataset in which all methods struggle.
  • Figure 5: Receiver operating characteristic and precision-recall curves for the bad link prediction model that ranks 1M false positives higher than the 100k true edges. The model achieves 0.99 in AUC and 0.95 AP under biased testing, while the more informative performance evaluation metric, Average Precision (AP) under unbiased testing, is only 0.05.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Lemma 1
  • theorem 1
  • Lemma 2
  • Lemma 3
  • Lemma 4