CORE: Data Augmentation for Link Prediction via Information Bottleneck

Kaiwen Dong; Zhichun Guo; Nitesh V. Chawla

CORE: Data Augmentation for Link Prediction via Information Bottleneck

Kaiwen Dong, Zhichun Guo, Nitesh V. Chawla

TL;DR

CORE tackles the challenge of noisy and incomplete graphs in link prediction by introducing a two-stage data augmentation framework grounded in the Information Bottleneck. The Complete stage inflates the graph by adding high-probability edges, while the Reduce stage prunes edges through a Graph Information Bottleneck objective, applied on per-target-link subgraphs to avoid cross-link interference. The method uses variational bounds to optimize a loss that balances predictive power and compression, supported by theoretical guarantees under local-dependency assumptions. Empirically, CORE consistently improves Hits@50, boosts the usefulness of heuristic predictors, and enhances robustness to adversarial perturbations across diverse datasets and backbones, demonstrating practical value for robust LP in graph learning.

Abstract

Link prediction (LP) is a fundamental task in graph representation learning, with numerous applications in diverse domains. However, the generalizability of LP models is often compromised due to the presence of noisy or spurious information in graphs and the inherent incompleteness of graph data. To address these challenges, we draw inspiration from the Information Bottleneck principle and propose a novel data augmentation method, COmplete and REduce (CORE) to learn compact and predictive augmentations for LP models. In particular, CORE aims to recover missing edges in graphs while simultaneously removing noise from the graph structures, thereby enhancing the model's robustness and performance. Extensive experiments on multiple benchmark datasets demonstrate the applicability and superiority of CORE over state-of-the-art methods, showcasing its potential as a leading approach for robust LP in graph representation learning.

CORE: Data Augmentation for Link Prediction via Information Bottleneck

TL;DR

Abstract

Paper Structure (68 sections, 1 theorem, 15 equations, 5 figures, 8 tables)

This paper contains 68 sections, 1 theorem, 15 equations, 5 figures, 8 tables.

Introduction
Present work.
Preliminary
Graph and link prediction.
Subgraph link prediction.
Data augmentation.
Proposed framework: CORE
Complete stage: inflating missing connections
Implementation.
Reduce stage: pruning noisy edges
Interdependence of graph data.
GIB
Implementation of the Reduce stage.
Subgraph encoding.
Reduce by edge sampling.
...and 53 more sections

Key Result

Theorem 1

Assume that: (1) The existence $Y$ of a link $(i,j)$ is solely determined by its local neighborhood $G_{(i,j)}^{*}$ in a way such that $p(Y) = f(G_{(i,j)}^{*})$, where $f$ is a deterministic invertible function; (2) The inflated graph contains sufficient structures for prediction $G_{(i,j)}^{*} \in

Figures (5)

Figure 1: Overview of our CORE framework. It consists of two stages: (1) the Complete stage, which aims to recover missing edges by incorporating highly probable edges into the original graph, and (2) the Reduce stage, which is the core component of our method, designed to prune noisy edges from the graph in order to prevent overfitting on the intrinsic noise and that introduced by the Complete stage. Recognizing that predicting different links may require distinct augmentations, we extract the surrounding subgraph of each link and apply independent augmentations accordingly. In the social network example illustrated, assuming that Adams and Terry will become friends while Adams and Henry will not, tailored augmentations can facilitate more accurate link prediction by the model.
Figure 2: The Reduce stage commences with the inflated subgraph $G^{+}_{(i,j)}$ surrounding the target link $(i, j)$. We first apply a GNN to encode node representations, followed by edge representation derived from the node encodings. To compute sampling probability scores for each edge, we utilize an attention mechanism that combines the edge representation with the subgraph pooling. Since the subgraph pooling encapsulates information from the entire subgraph and is employed for target link prediction, the generated probability scores reflect not only the edge's inherent property but also its relationship to the target link $(i, j)$. Subsequently, we sample each edge using a Bernoulli distribution based on its probability to obtain the pruned graph. Finally, the pruned graph $G^{\pm}_{(i,j)}$ is fed back into the model as augmented input for enhanced graph structures.
Figure 3: CORE can enhance the graph structure and even boost heuristics link predictors (Hits@50).
Figure 4: Histogram representing the standard deviations (std) of the learned edge mask $\omega$ for each edge within subgraphs associated with different target links. The frequent occurrence of larger std values implies substantial disagreement on the optimal DAs when focusing on different target links.
Figure 5: CORE can improve LP performance in various hyperparameter settings measured by Hits@50. Warmer colors indicate improved performance over the baseline, whereas cooler colors signify the contrary.

Theorems & Definitions (2)

Theorem 1
proof

CORE: Data Augmentation for Link Prediction via Information Bottleneck

TL;DR

Abstract

CORE: Data Augmentation for Link Prediction via Information Bottleneck

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)