Table of Contents
Fetching ...

Graph sub-sampling for divide-and-conquer algorithms in large networks

Eric Yanchenko

TL;DR

The paper addresses how graph sub-sampling affects divide-and-conquer analyses on large networks, focusing on two meso-scale tasks: community structure and core-periphery detection. It introduces two divide-and-conquer frameworks (PACE for communities and a CP method) and evaluates seven sub-sampling schemes, providing theoretical bounds on mis-classification rates under various schemes. Across extensive simulations and real-data analyses, random-node sampling often yields the best performance for community detection, while core-biased samplers (e.g., random edge and random walk) excel for CP detection; notably, CP divide-and-conquer often outperforms applying the base algorithm to the full graph in both accuracy and speed. The results highlight that the choice of sub-sampling routine should be tailored to the specific task and dataset to maximize performance and efficiency.

Abstract

As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, however, sub-sampling is not a trivial task. While this problem has gained popularity in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance, but that sometimes the base algorithm applied to the entire network yields better results both in terms of identification and computational time. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. Unlike community detection, the CP divide-and-conquer algorithm tends to yield better identification results while also being faster than the base algorithm. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.

Graph sub-sampling for divide-and-conquer algorithms in large networks

TL;DR

The paper addresses how graph sub-sampling affects divide-and-conquer analyses on large networks, focusing on two meso-scale tasks: community structure and core-periphery detection. It introduces two divide-and-conquer frameworks (PACE for communities and a CP method) and evaluates seven sub-sampling schemes, providing theoretical bounds on mis-classification rates under various schemes. Across extensive simulations and real-data analyses, random-node sampling often yields the best performance for community detection, while core-biased samplers (e.g., random edge and random walk) excel for CP detection; notably, CP divide-and-conquer often outperforms applying the base algorithm to the full graph in both accuracy and speed. The results highlight that the choice of sub-sampling routine should be tailored to the specific task and dataset to maximize performance and efficiency.

Abstract

As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, however, sub-sampling is not a trivial task. While this problem has gained popularity in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance, but that sometimes the base algorithm applied to the entire network yields better results both in terms of identification and computational time. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. Unlike community detection, the CP divide-and-conquer algorithm tends to yield better identification results while also being faster than the base algorithm. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
Paper Structure (34 sections, 43 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 34 sections, 43 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Community detection simulation results with Fast Greedy base algorithm. The number in the upper-left corner corresponds to the simulation setting (1-6).
  • Figure 2: Community detection simulation run-time results with Fast Greedy base algorithm. The number in the upper-left corner corresponds to the simulation setting (1-6).
  • Figure 3: Core-periphery simulation accuracy results. The number in the upper-left corner corresponds to the simulation setting (7-12).
  • Figure 4: Core-periphery simulation run-time results. The number in the upper-left corner corresponds to the simulation setting (7-12).
  • Figure 5: Comparison of cores returned using different sub-sampling algorithms. The color of the square corresponds to the Jacaard coefficient between the two core sets with lighter color meaning more similarity.
  • ...and 7 more figures