Table of Contents
Fetching ...

CGLE: Class-label Graph Link Estimator for Link Prediction

Ankit Mazumder, Srikanta Bedathur

TL;DR

CGLE introduces class-label guidance into link prediction by constructing a class-conditioned probability matrix and fusing it with backbone GNN embeddings via an MLP. The method leverages both true or pseudo-class labels to capture global priors on inter-class link formation, enabling improved performance on diverse graphs, including sparse and heterophilous networks. The framework extends NCN/NCNC with a two-phase pipeline: compute class priors in preprocessing and integrate them with structural signals at prediction time, while remaining computationally efficient. Empirical results on a wide range of datasets demonstrate substantial gains over strong baselines, validating the utility of semantic priors in graph link analysis and highlighting CGLE’s practicality and adaptability.

Abstract

Link prediction is a pivotal task in graph mining with wide-ranging applications in social networks, recommendation systems, and knowledge graph completion. However, many leading Graph Neural Network (GNN) models often neglect the valuable semantic information aggregated at the class level. To address this limitation, this paper introduces CGLE (Class-label Graph Link Estimator), a novel framework designed to augment GNN-based link prediction models. CGLE operates by constructing a class-conditioned link probability matrix, where each entry represents the probability of a link forming between two node classes. This matrix is derived from either available ground-truth labels or from pseudo-labels obtained through clustering. The resulting class-based prior is then concatenated with the structural link embedding from a backbone GNN, and the combined representation is processed by a Multi-Layer Perceptron (MLP) for the final prediction. Crucially, CGLE's logic is encapsulated in an efficient preprocessing stage, leaving the computational complexity of the underlying GNN model unaffected. We validate our approach through extensive experiments on a broad suite of benchmark datasets, covering both homophilous and sparse heterophilous graphs. The results show that CGLE yields substantial performance gains over strong baselines such as NCN and NCNC, with improvements in HR@100 of over 10 percentage points on homophilous datasets like Pubmed and DBLP. On sparse heterophilous graphs, CGLE delivers an MRR improvement of over 4% on the Chameleon dataset. Our work underscores the efficacy of integrating global, data-driven semantic priors, presenting a compelling alternative to the pursuit of increasingly complex model architectures. Code to reproduce our findings is available at: https://github.com/data-iitd/cgle-icdm2025.

CGLE: Class-label Graph Link Estimator for Link Prediction

TL;DR

CGLE introduces class-label guidance into link prediction by constructing a class-conditioned probability matrix and fusing it with backbone GNN embeddings via an MLP. The method leverages both true or pseudo-class labels to capture global priors on inter-class link formation, enabling improved performance on diverse graphs, including sparse and heterophilous networks. The framework extends NCN/NCNC with a two-phase pipeline: compute class priors in preprocessing and integrate them with structural signals at prediction time, while remaining computationally efficient. Empirical results on a wide range of datasets demonstrate substantial gains over strong baselines, validating the utility of semantic priors in graph link analysis and highlighting CGLE’s practicality and adaptability.

Abstract

Link prediction is a pivotal task in graph mining with wide-ranging applications in social networks, recommendation systems, and knowledge graph completion. However, many leading Graph Neural Network (GNN) models often neglect the valuable semantic information aggregated at the class level. To address this limitation, this paper introduces CGLE (Class-label Graph Link Estimator), a novel framework designed to augment GNN-based link prediction models. CGLE operates by constructing a class-conditioned link probability matrix, where each entry represents the probability of a link forming between two node classes. This matrix is derived from either available ground-truth labels or from pseudo-labels obtained through clustering. The resulting class-based prior is then concatenated with the structural link embedding from a backbone GNN, and the combined representation is processed by a Multi-Layer Perceptron (MLP) for the final prediction. Crucially, CGLE's logic is encapsulated in an efficient preprocessing stage, leaving the computational complexity of the underlying GNN model unaffected. We validate our approach through extensive experiments on a broad suite of benchmark datasets, covering both homophilous and sparse heterophilous graphs. The results show that CGLE yields substantial performance gains over strong baselines such as NCN and NCNC, with improvements in HR@100 of over 10 percentage points on homophilous datasets like Pubmed and DBLP. On sparse heterophilous graphs, CGLE delivers an MRR improvement of over 4% on the Chameleon dataset. Our work underscores the efficacy of integrating global, data-driven semantic priors, presenting a compelling alternative to the pursuit of increasingly complex model architectures. Code to reproduce our findings is available at: https://github.com/data-iitd/cgle-icdm2025.

Paper Structure

This paper contains 25 sections, 2 theorems, 14 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $x, y \in V$ be a pair of nodes in an undirected graph $G = (V, E)$. A $\gamma$-decaying structural heuristic for link prediction between $x$ and $y$ is defined as: where $\gamma \in (0,1)$ is a decay factor, $\eta > 0$ is a bounded scaling constant, and $f(x, y, l)$ encodes structural features (e.g., number of walks or path-based statistics) of length $l$ between nodes $x$ and $y$. If $f(x,

Figures (6)

  • Figure 1: In this illustration of class-label guided link prediction, node colors represent their class. The goal is to predict a link between the disconnected nodes V5 and V12. Standard methods like CN, which underpin top models, would fail to predict this link. CGLE, however, can predict the connection by incorporating the nodes' class information and identifying the significant global co-occurrence pattern between the yellow and orange classes.
  • Figure 2: The CGLE architecture for the task of link prediction, exemplified by the target link $V_5 \leftrightarrow V_{12}$. The underlying backbone model is NCN/NCNC, which incorporates subgraph extraction and, in the case of NCNC, a neighbor completion module. The broader CGLE framework is designed to be compatible with various GNN-based architectures for link prediction.
  • Figure 3: Elbow plots for determining the optimal number of clusters ($k$). The plots illustrate the Sum of Squared Distances (SSD) for varying $k$ on the Cora, Coauthor-Physics, and Roman-empire datasets.
  • Figure 4: Execution time (in seconds) for the CGLE(NCN) and CGLE(NCNC) models compared to the NCN and NCNC baselines. For brevity, this plot shows runtime on four selected datasets: Citeseer, DBLP, Pubmed and Roman-empire.
  • Figure 5: Link prediction performance of CGLE, using NCN and NCNC backbones, across 12 datasets for different numbers of k-means clusters ($k \in \{1, 2, 5, 10, 15\}$). The $k=1$ case serves as a baseline, equivalent to running the backbone models without class labels. The first seven datasets are homophilous, and the remaining five are heterophilous. Each subplot shows a specific performance metric (HR@100, MRR, or HR@10) for one dataset.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Proposition 1