Table of Contents
Fetching ...

A framework to generate hypergraphs with community structure

Nicolò Ruggeri, Federico Battiston, Caterina De Bacco

TL;DR

A flexible and efficient framework for the generation of hypergraphs with many nodes and large hyperedges, which allows specifying general community structures and tune different local statistics, and constitutes a substantial advancement in the statistical modeling of higher-order systems.

Abstract

In recent years hypergraphs have emerged as a powerful tool to study systems with multi-body interactions which cannot be trivially reduced to pairs. While highly structured methods to generate synthetic data have proved fundamental for the standardized evaluation of algorithms and the statistical study of real-world networked data, these are scarcely available in the context of hypergraphs. Here we propose a flexible and efficient framework for the generation of hypergraphs with many nodes and large hyperedges, which allows specifying general community structures and tune different local statistics. We illustrate how to use our model to sample synthetic data with desired features (assortative or disassortative communities, mixed or hard community assignments, etc.), analyze community detection algorithms, and generate hypergraphs structurally similar to real-world data. Overcoming previous limitations on the generation of synthetic hypergraphs, our work constitutes a substantial advancement in the statistical modeling of higher-order systems.

A framework to generate hypergraphs with community structure

TL;DR

A flexible and efficient framework for the generation of hypergraphs with many nodes and large hyperedges, which allows specifying general community structures and tune different local statistics, and constitutes a substantial advancement in the statistical modeling of higher-order systems.

Abstract

In recent years hypergraphs have emerged as a powerful tool to study systems with multi-body interactions which cannot be trivially reduced to pairs. While highly structured methods to generate synthetic data have proved fundamental for the standardized evaluation of algorithms and the statistical study of real-world networked data, these are scarcely available in the context of hypergraphs. Here we propose a flexible and efficient framework for the generation of hypergraphs with many nodes and large hyperedges, which allows specifying general community structures and tune different local statistics. We illustrate how to use our model to sample synthetic data with desired features (assortative or disassortative communities, mixed or hard community assignments, etc.), analyze community detection algorithms, and generate hypergraphs structurally similar to real-world data. Overcoming previous limitations on the generation of synthetic hypergraphs, our work constitutes a substantial advancement in the statistical modeling of higher-order systems.
Paper Structure (35 sections, 2 theorems, 33 equations, 9 figures, 3 algorithms)

This paper contains 35 sections, 2 theorems, 33 equations, 9 figures, 3 algorithms.

Key Result

Theorem 1

Consider $d_i,k_{\ell}$ as defined above. Furthermore, assume that $u$ is bounded, i.e. $\exists L > 0: u <L$, where the inequality is intended element-wise. Then:

Figures (9)

  • Figure 1: Sampling hypergraphs with community structure. A pictorial representation of two small hypergraphs with $N=10$ nodes, $K=2$ communities, and (A) hard or (B) overlapping membership assignment. Every node's membership assignment $u_i = (u_{i1}, u_{i2})$ is represented as a pie chart. Single colored nodes have hard assignments, mixed charts represent overlapping assignments. Due to the likelihood in \ref{['eq: lambda']}, nodes with overlapping assignments are more likely to belong to between-community interactions.
  • Figure 2: Sampling hypergraphs with hard and soft community assignment. (A) We sample hypergraphs from a model with $K=5$ equally-sized communities, an assortative affinity matrix $w$, and different node community memberships $u$ (from hard to soft). The five shaded yellow circles represent different communities, the thicknesses of the edges and circles are proportional to the interaction strength between and within communities. (B) The entropy of community memberships grows as increasingly overlapping configurations are considered. (C) We show the maximum assignment ratio (the relative number of nodes belonging to the majority class for each hyperedge) across hyperedge sizes. Orange circles are proportional to the amount of hyperedges with a given maximum assignment ratio.
  • Figure 3: Sampling hypergraphs with assortative and disassortative affinity and heterogeneous community size. (A) We sample hypergraphs with five communities of different sizes and hard membership assignments. We vary the affinity matrix $w$ from assortative (left, diagonal) to disassortative (right, uniform matrix filled with ones). Shaded yellow circles represent the communities, the thicknesses of the edges and circles are proportional to the interaction strength between and within communities. (B) We vary the affinity $w$ from diagonal (left) and increase its entries $w_{12}, w_{21}$ (right) for $K=3$ equally-sized communities. Nodes represent communities and the thickness of the edges and circles is proportional to the strength of the interactions between and within communities.
  • Figure 4: Evaluating higher-order community detection algorithms. We sample hypergraphs to test the ability of different higher-order community detection algorithms to recover well-defined planted partitions. We consider hypergraphs with $N=500$ nodes, $K=3$ equally sized assortative communities and hard assignments. We plot the cosine similarity between the inferred partitions and the ground truth as a function of the maximum hyperedge size. Additional details on the data generation are given in \ref{['sec supp: generation of synthetic benchmark']}.
  • Figure 5: Computational complexity and scalability. We plot the computational cost of our sampling model for sparse hypergraphs as a function of the system size $N$. Our model is highly efficient, as it allows sampling of sparse hypergraphs of dimensions up to $N=10^5$ nodes in less than one hour. We show results for hypergraphs with fixed expected degree equal to 5, both for an exact (solid line) and an approximate approach (dashed line) based on central limit theorem sampling of dyadic interactions. Here, we utilize $K=5$ communities and unconstrained maximum hyperedge size $D=N$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Definition
  • Theorem 1
  • proof
  • Theorem 2