Table of Contents
Fetching ...

AEGIS: Authentic Edge Growth In Sparsity for Link Prediction in Edge-Sparse Bipartite Knowledge Graphs

Hugh Xuechen Liu, Kıvanç Tatar

TL;DR

This work addresses link prediction in highly edge-sparse bipartite knowledge graphs by introducing Authentic Edge Growth In Sparsity (AEGIS), an edge-only augmentation framework that duplicates observed training edges without creating new endpoints. The authors compare five policies, including simple and degree-aware authenticity-constrained resampling, random ER-like additions, perturbation-based synthesis, and semantic-KNN augmentation, under a high-rate bond-percolation sparsity stress test. Across Amazon, MovieLens, and a GDP domain case study, copy-based AEGIS variants provide strong sparsity baselines while semantic-KNN augmentation consistently improves AUC and calibration when textual or descriptive features are informative, especially in GDP. The results highlight a trade-off between maintaining topology and leveraging semantic signals, and suggest that text richness and evaluation metric choices (AUC vs Brier) determine the most effective augmentation strategy. The work points to practical, data-efficient strategies for sparse bipartite link prediction and outlines avenues for extending authenticity-constrained methods with adaptive, domain-aware semantics.

Abstract

Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern's game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.

AEGIS: Authentic Edge Growth In Sparsity for Link Prediction in Edge-Sparse Bipartite Knowledge Graphs

TL;DR

This work addresses link prediction in highly edge-sparse bipartite knowledge graphs by introducing Authentic Edge Growth In Sparsity (AEGIS), an edge-only augmentation framework that duplicates observed training edges without creating new endpoints. The authors compare five policies, including simple and degree-aware authenticity-constrained resampling, random ER-like additions, perturbation-based synthesis, and semantic-KNN augmentation, under a high-rate bond-percolation sparsity stress test. Across Amazon, MovieLens, and a GDP domain case study, copy-based AEGIS variants provide strong sparsity baselines while semantic-KNN augmentation consistently improves AUC and calibration when textual or descriptive features are informative, especially in GDP. The results highlight a trade-off between maintaining topology and leveraging semantic signals, and suggest that text richness and evaluation metric choices (AUC vs Brier) determine the most effective augmentation strategy. The work points to practical, data-efficient strategies for sparse bipartite link prediction and outlines avenues for extending authenticity-constrained methods with adaptive, domain-aware semantics.

Abstract

Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern's game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.

Paper Structure

This paper contains 92 sections, 1 equation, 39 figures, 9 tables, 5 algorithms.

Figures (39)

  • Figure 1: Amazon (product--category), GAT, $q{=}0.01$, $\phi{=}100$: Comprehensive degree analysis ($M\pm\mathrm{SD}$, $n=32$ seeds). Panel (a) shows degree distributions on log-log scale with $\pm 1\sigma$ confidence bands; (b) Power Law fits with exponent $\alpha$ (lower $\alpha$ = heavier tail); (c) Log-normal fits with parameters $\mu$ and $\sigma$; (d) Gini coefficients quantifying degree inequality ($0{=}$perfect equality, $1{=}$maximum inequality); (e) best-fit distribution counts (lower KS statistic wins); (f) summary statistics table.
  • Figure 2: MovieLens (movie--genre), GAT, $q{=}0.01$, $\phi{=}100$: Comprehensive degree analysis ($M\pm\mathrm{SD}$, $n=32$ seeds). Panel (a) shows degree distributions on log-log scale with $\pm 1\sigma$ confidence bands; (b) Power Law fits with exponent $\alpha$ (lower $\alpha$ = heavier tail); (c) Log-normal fits with parameters $\mu$ and $\sigma$; (d) Gini coefficients quantifying degree inequality ($0{=}$perfect equality, $1{=}$maximum inequality); (e) best-fit distribution counts (lower KS statistic wins); (f) summary statistics table.
  • Figure 3: GDP (game--pattern), GAT, $q{=}0.01$, $\phi{=}100$: Comprehensive degree analysis ($M\pm\mathrm{SD}$, $n=32$ seeds). Panel (a) shows degree distributions on log-log scale with $\pm 1\sigma$ confidence bands; (b) Power Law fits with exponent $\alpha$ (lower $\alpha$ = heavier tail); (c) Log-normal fits with parameters $\mu$ and $\sigma$; (d) Gini coefficients quantifying degree inequality ($0{=}$perfect equality, $1{=}$maximum inequality); (e) best-fit distribution counts (lower KS statistic wins); (f) summary statistics table.
  • Figure 4: Amazon (product--category), GAT, $q{=}0.01$, $\phi{=}100$: Comprehensive analysis ($M\pm\mathrm{SD}$, $n=32$ seeds) comparing baseline, augmentation methods, and original graph. Panel (a) shows degree distributions on log-log scale with confidence bands; (b) Power Law fits with exponent $\alpha$; (c) Log-normal fits with parameters $\mu$ and $\sigma$; (d) Gini coefficients quantifying degree inequality (lower = more uniform); (e) runtime comparison showing training time (left axis) and augmentation time (right axis, log scale); (f) best-fit distribution counts across methods.
  • Figure 5: Amazon (product--category), GAT, $q{=}0.01$, $\phi{=}5$: Comprehensive analysis ($M\pm\mathrm{SD}$, $n=32$ seeds) comparing baseline, augmentation methods, and original graph. Panel (a) shows degree distributions on log-log scale with confidence bands; (b) Power Law fits with exponent $\alpha$; (c) Log-normal fits with parameters $\mu$ and $\sigma$; (d) Gini coefficients quantifying degree inequality (lower = more uniform); (e) runtime comparison showing training time (left axis) and augmentation time (right axis, log scale); (f) best-fit distribution counts across methods.
  • ...and 34 more figures