Table of Contents
Fetching ...

Neural Common Neighbor with Completion for Link Prediction

Xiyuan Wang, Haotong Yang, Muhan Zhang

TL;DR

The paper tackles link prediction by addressing a fundamental limitation of standard MPNNs: symmetric node representations can obscure pairwise relations between target nodes. It introduces an MPNN-then-SF architecture, instantiated as Neural Common Neighbor (NCN), which blends learnable node representations with structural features derived from common neighbors to boost expressivity and scalability. Recognizing that real graphs are often incomplete, it analyzes how incomplete data biases common-neighbor signals and proposes Common Neighbor Completion (CNC) followed by Neural Common Neighbor with Completion (NCNC) to mitigate this issue. Empirical results across seven real-world benchmarks show that NCN and NCNC achieve state-of-the-art performance with favorable efficiency, highlighting the practical impact for scalable link prediction under imperfect data conditions.

Abstract

In this work, we propose a novel link prediction model and further boost it by studying graph incompleteness. First, we introduce MPNN-then-SF, an innovative architecture leveraging structural feature (SF) to guide MPNN's representation pooling, with its implementation, namely Neural Common Neighbor (NCN). NCN exhibits superior expressiveness and scalability compared with existing models, which can be classified into two categories: SF-then-MPNN, augmenting MPNN's input with SF, and SF-and-MPNN, decoupling SF and MPNN. Second, we investigate the impact of graph incompleteness -- the phenomenon that some links are unobserved in the input graph -- on SF, like the common neighbor. Through dataset visualization, we observe that incompleteness reduces common neighbors and induces distribution shifts, significantly affecting model performance. To address this issue, we propose to use a link prediction model to complete the common neighbor structure. Combining this method with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins, and NCNC further surpasses state-of-the-art models in standard link prediction benchmarks. Our code is available at https://github.com/GraphPKU/NeuralCommonNeighbor.

Neural Common Neighbor with Completion for Link Prediction

TL;DR

The paper tackles link prediction by addressing a fundamental limitation of standard MPNNs: symmetric node representations can obscure pairwise relations between target nodes. It introduces an MPNN-then-SF architecture, instantiated as Neural Common Neighbor (NCN), which blends learnable node representations with structural features derived from common neighbors to boost expressivity and scalability. Recognizing that real graphs are often incomplete, it analyzes how incomplete data biases common-neighbor signals and proposes Common Neighbor Completion (CNC) followed by Neural Common Neighbor with Completion (NCNC) to mitigate this issue. Empirical results across seven real-world benchmarks show that NCN and NCNC achieve state-of-the-art performance with favorable efficiency, highlighting the practical impact for scalable link prediction under imperfect data conditions.

Abstract

In this work, we propose a novel link prediction model and further boost it by studying graph incompleteness. First, we introduce MPNN-then-SF, an innovative architecture leveraging structural feature (SF) to guide MPNN's representation pooling, with its implementation, namely Neural Common Neighbor (NCN). NCN exhibits superior expressiveness and scalability compared with existing models, which can be classified into two categories: SF-then-MPNN, augmenting MPNN's input with SF, and SF-and-MPNN, decoupling SF and MPNN. Second, we investigate the impact of graph incompleteness -- the phenomenon that some links are unobserved in the input graph -- on SF, like the common neighbor. Through dataset visualization, we observe that incompleteness reduces common neighbors and induces distribution shifts, significantly affecting model performance. To address this issue, we propose to use a link prediction model to complete the common neighbor structure. Combining this method with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins, and NCNC further surpasses state-of-the-art models in standard link prediction benchmarks. Our code is available at https://github.com/GraphPKU/NeuralCommonNeighbor.
Paper Structure (44 sections, 4 theorems, 19 equations, 8 figures, 9 tables)

This paper contains 44 sections, 4 theorems, 19 equations, 8 figures, 9 tables.

Key Result

Theorem 1

Combination of Equation equ:GenPairwise2 and Equation equ:GenPairwise3 are strictly more expressive than MPNN-only model: GAE, SF-only models: CN, RA, AA, and MPNN-and-SF models: Neo-GNN, BUDDY.

Figures (8)

  • Figure 1: The failure of MPNN in link prediction task. $v_2$ and $v_3$ have equal MPNN node representations due to symmetry. However, with different pairwise relations, $(v_1, v_2)$ and $(v_1, v_3)$ should have different representations.
  • Figure 2: Archtectures for combining SF and MPNN. $A$ denote the input graph structure. Existing works are (1) SF-then-MPNN (2) SF-and-MPNN architectures. We propose a completely new architecture (3) MPNN-then-SF.
  • Figure 3: White, green, and yellow colors represent node features $0, 1$, and $2$, respectively. Both links $(v_1, v_2)$ and $(v_1, v_3)$ have one common neighbor, making it indistinguishable for existing SF-and-MPNN models. However, NCN can differentiate between them because the two common neighbors have different features.
  • Figure 4: Visualization of incompleteness on datasets. The incomplete graph only contains edges in the training set, and the complete graph further contains edges in the validation and test set. (a) and (b) visualize the ogbl-collab dataset. (c) and (d) visualize the Cora dataset. (a) and (c) are for distributions of the number of common neighbors of the training edges and test edges. (b) and (d) show performance of CN on the training set and test set.
  • Figure 5: Inference time and GPU memory on ogbl-collab. The process we measure includes preprocessing and predicting one batch of test links. As shown in Appendix \ref{['app:complexity']}, relation between time $y$ and batch size $t$ is $y=B+Ct$, where $B,C$ are model-specific constants. SEAL has out-of-memory problem and only uses small batch sizes.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof