Table of Contents
Fetching ...

Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns

Xie He, Amir Ghasemian, Eun Lee, Alice Schwarze, Aaron Clauset, Peter J. Mucha

TL;DR

This study employs 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that are categorize into 5 groups, and aims to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data.

Abstract

Real-world network datasets are typically obtained in ways that fail to capture all edges. The patterns of missing data are often non-uniform as they reflect biases and other shortcomings of different data collection methods. Nevertheless, uniform missing data is a common assumption made when no additional information is available about the underlying missing-edge pattern, and link prediction methods are frequently tested against uniformly missing edges. To investigate the impact of different missing-edge patterns on link prediction accuracy, we employ 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that we categorize into 5 groups. Our comparative simulation study, spanning 250 real-world network datasets from 6 different domains, provides a detailed picture of the significant variations in the performance of different link prediction algorithms in these different settings. With this study, we aim to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data, considering the data collection process and application domain.

Link Prediction Accuracy on Real-World Networks Under Non-Uniform Missing Edge Patterns

TL;DR

This study employs 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that are categorize into 5 groups, and aims to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data.

Abstract

Real-world network datasets are typically obtained in ways that fail to capture all edges. The patterns of missing data are often non-uniform as they reflect biases and other shortcomings of different data collection methods. Nevertheless, uniform missing data is a common assumption made when no additional information is available about the underlying missing-edge pattern, and link prediction methods are frequently tested against uniformly missing edges. To investigate the impact of different missing-edge patterns on link prediction accuracy, we employ 9 link prediction algorithms from 4 different families to analyze 20 different missing-edge patterns that we categorize into 5 groups. Our comparative simulation study, spanning 250 real-world network datasets from 6 different domains, provides a detailed picture of the significant variations in the performance of different link prediction algorithms in these different settings. With this study, we aim to provide a guide for future researchers to help them select a link prediction algorithm that is well suited to their sampled network data, considering the data collection process and application domain.
Paper Structure (1 section, 8 figures, 6 tables)

This paper contains 1 section, 8 figures, 6 tables.

Table of Contents

  1. Supporting Information

Figures (8)

  • Figure 1: AUC scores over 5 runs on each network for 9 link prediction algorithms on samples obtained by 20 methods. The 250 different networks are grouped into 6 domains (arranged vertically). Symbols indicate mean AUCs, with standard deviations shown by vertical bars. The sampling methods are listed along the bottom of the figure. The prediction methods are marked with different colors, as indicated in the legend at the top.
  • Figure 2: Box plots of AUCs from different link prediction methods for different families of missingness patterns, grouped by network domain.
  • Figure 3: PCA scores (PC1 horizontal, PC2 vertical) of different sampling methods under different prediction algorithms and different dataset domains. Each panel considers a single link prediction method within a single dataset domain, taking as features the full set of AUC scores (averaged over 5 runs) of that prediction method across the networks in that domain for each sampling method, marked with different colors and symbols as indicated in the legend.
  • Figure 4: AUCs for Edge-Based Missingness Patterns from different link prediction methods, grouped by network domain.
  • Figure 5: AUCs for Node-Based Missingness Patterns from different link prediction methods, grouped by network domain.
  • ...and 3 more figures