Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning

Junghun Kim; Ka Hyun Park; Hoyoung Yoon; U Kang

Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning

Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang

TL;DR

PULL reframes link prediction on edge-incomplete graphs as a positive-unlabeled learning problem, introducing latent edge variables and an expected graph $\bar{\mathcal{G}}$ to propagate information beyond observed edges. By modeling the joint distribution on a line graph with node potentials tied to predicted linking probabilities and enforcing an EM-like learning signal, PULL iteratively refines both the graph structure and the link predictor. The approach yields state-of-the-art performance across multiple real-world datasets, demonstrates compatibility with several GCNN-based baselines, and scales linearly with graph size. The results highlight the practical impact of accounting for unlabeled edges when predicting missing links in real networks.

Abstract

Given an edge-incomplete graph, how can we accurately find the missing links? The link prediction in edge-incomplete graphs aims to discover the missing relations between entities when their relationships are represented as a graph. Edge-incomplete graphs are prevalent in real-world due to practical limitations, such as not checking all users when adding friends in a social network. Addressing the problem is crucial for various tasks, including recommending friends in social networks and finding references in citation networks. However, previous approaches rely heavily on the given edge-incomplete (observed) graph, making it challenging to consider the missing (unobserved) links during training. In this paper, we propose PULL (PU-Learning-based Link predictor), an accurate link prediction method based on the positive-unlabeled (PU) learning. PULL treats the observed edges in the training graph as positive examples, and the unconnected node pairs as unlabeled ones. PULL effectively prevents the link predictor from overfitting to the observed graph by proposing latent variables for every edge, and leveraging the expected graph structure with respect to the variables. Extensive experiments on five real-world datasets show that PULL consistently outperforms the baselines for predicting links in edge-incomplete graphs.

Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning

TL;DR

PULL reframes link prediction on edge-incomplete graphs as a positive-unlabeled learning problem, introducing latent edge variables and an expected graph

to propagate information beyond observed edges. By modeling the joint distribution on a line graph with node potentials tied to predicted linking probabilities and enforcing an EM-like learning signal, PULL iteratively refines both the graph structure and the link predictor. The approach yields state-of-the-art performance across multiple real-world datasets, demonstrates compatibility with several GCNN-based baselines, and scales linearly with graph size. The results highlight the practical impact of accounting for unlabeled edges when predicting missing links in real networks.

Abstract

Paper Structure (37 sections, 3 theorems, 16 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 3 theorems, 16 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Works
Link Prediction in Graphs
Graph-based PU Learning
Proposed Method
Modeling Missing Links (C1)
Expectation of Graph Structure (C2)
Designing the joint probability.
Computing the expectation of graph.
Approximating the expected graph.
Iterative Learning of Link Predictor (C3)
Theoretical Analysis
Relation of PULL to EM algorithm.
Complexity of PULL.
Experiments
...and 22 more sections

Key Result

Theorem 1

Given the assumption in Equation (eq:theory-2), the likelihood $Q(\theta^\mathrm{new} \mid \theta)$ of the EM algorithm reduces to the negative of the loss $\mathcal{L}\xspace_E$ of PULL with the expected graph $\mathcal{\bar{G}}\xspace$: where $\hat{y}_{ij}$ is the estimated linking probability between nodes $i$ and $j$ by $f_{\theta^\mathrm{new}}$, and $\mathbf{A}\xspace^\mathcal{\bar{G}}\xspac

Figures (4)

Figure 1: Main challenge of previous works. They cannot consider the hidden unobserved edges in the given graph. PULL treats the unconnected node pairs as unlabeled examples, and utilizes the expectation of graph structure.
Figure 2: Overall structure of PULL. Given an edge-incomplete graph $\mathcal{G}\xspace_\mathcal{P}\xspace$ with a set $\mathcal{P}\xspace$ of observed edges, PULL first computes the expected graph structure $\mathcal{\bar{G}}\xspace$ by proposing latent variables for the edges. Then PULL utilizes $\mathcal{\bar{G}}\xspace$ to update the link predictor $f$. The marginal linking probabilities $\hat{y}$ obtained by the updated $f$ are used to compute $\mathcal{\bar{G}}\xspace$ in the next iteration.
Figure 3: AUC score of PULL and PULL-$\mathcal{L}\xspace_C$ through the iterations. PULL-$\mathcal{L}\xspace_C$ represents PULL without $\mathcal{L}\xspace_C$. The dashed gray lines denote the ground-truth numbers of edges. The accuracy of PULL increases in early iterations, and converges or slightly increases as the number $K$ of sampled edges exceeds the ground-truth one. This shows that PULL improves the quality of the expected graph with each iteration. Moreover, PULL consistently shows superior performance than PULL-$\mathcal{L}\xspace_C$. In PubMed and Crocodile, the accuracy of PULL-$\mathcal{L}\xspace_C$ drops rapidly after exceeding the dashed gray lines. This demonstrates that $\mathcal{L}\xspace_C$ protects PULL from performance degradation when the expected graph structure has more edges than the actual graph.
Figure 4: The running time of PULL on sampled subgraphs. The time increases linearly with the number of edges.

Theorems & Definitions (6)

Theorem 1
Theorem 2
Lemma 1
proof
proof
proof

Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning

TL;DR

Abstract

Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)