Table of Contents
Fetching ...

Message-Passing on Hypergraphs: Detectability, Phase Transitions and Higher-Order Information

Nicolò Ruggeri, Alessandro Lonardi, Caterina De Bacco

TL;DR

The paper tackles the problem of detectability limits for community structure in hypergraphs by introducing HySBM, a higher-order extension of the stochastic block model, and a scalable message-passing framework for inference. It derives closed-form detectability bounds that depend on hyperedge-size distributions, assortativity, and overlap with the clique expansion, and ties these bounds to entropy-based information measures. The authors also provide an exact sampling method for synthetic hypergraphs and an EM-based procedure to learn model parameters, validating the theory on synthetic data and real-world High School interaction data. Collectively, the work advances theoretical understanding and practical tools for analyzing systems with higher-order interactions.

Abstract

Hypergraphs are widely adopted tools to examine systems with higher-order interactions. Despite recent advancements in methods for community detection in these systems, we still lack a theoretical analysis of their detectability limits. Here, we derive closed-form bounds for community detection in hypergraphs. Using a Message-Passing formulation, we demonstrate that detectability depends on hypergraphs' structural properties, such as the distribution of hyperedge sizes or their assortativity. Our formulation enables a characterization of the entropy of a hypergraph in relation to that of its clique expansion, showing that community detection is enhanced when hyperedges highly overlap on pairs of nodes. We develop an efficient Message-Passing algorithm to learn communities and model parameters on large systems. Additionally, we devise an exact sampling routine to generate synthetic data from our probabilistic model. With these methods, we numerically investigate the boundaries of community detection in synthetic datasets, and extract communities from real systems. Our results extend the understanding of the limits of community detection in hypergraphs and introduce flexible mathematical tools to study systems with higher-order interactions.

Message-Passing on Hypergraphs: Detectability, Phase Transitions and Higher-Order Information

TL;DR

The paper tackles the problem of detectability limits for community structure in hypergraphs by introducing HySBM, a higher-order extension of the stochastic block model, and a scalable message-passing framework for inference. It derives closed-form detectability bounds that depend on hyperedge-size distributions, assortativity, and overlap with the clique expansion, and ties these bounds to entropy-based information measures. The authors also provide an exact sampling method for synthetic hypergraphs and an EM-based procedure to learn model parameters, validating the theory on synthetic data and real-world High School interaction data. Collectively, the work advances theoretical understanding and practical tools for analyzing systems with higher-order interactions.

Abstract

Hypergraphs are widely adopted tools to examine systems with higher-order interactions. Despite recent advancements in methods for community detection in these systems, we still lack a theoretical analysis of their detectability limits. Here, we derive closed-form bounds for community detection in hypergraphs. Using a Message-Passing formulation, we demonstrate that detectability depends on hypergraphs' structural properties, such as the distribution of hyperedge sizes or their assortativity. Our formulation enables a characterization of the entropy of a hypergraph in relation to that of its clique expansion, showing that community detection is enhanced when hyperedges highly overlap on pairs of nodes. We develop an efficient Message-Passing algorithm to learn communities and model parameters on large systems. Additionally, we devise an exact sampling routine to generate synthetic data from our probabilistic model. With these methods, we numerically investigate the boundaries of community detection in synthetic datasets, and extract communities from real systems. Our results extend the understanding of the limits of community detection in hypergraphs and introduce flexible mathematical tools to study systems with higher-order interactions.
Paper Structure (37 sections, 4 theorems, 80 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 37 sections, 4 theorems, 80 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Assuming sparse hypergraphs where $c=O(1)$, the MP updates satisfy the following fixed-point equations to leading order in $N$. For all hyperedges $e \in E$ and nodes $i \in e$, the messages and marginals are given by: where $C' = \sum_{d=2}^D \binom{N-2}{d-2} \frac{1}{\kappa_d}$.

Figures (8)

  • Figure 1: Representing hypergraphs as factor graphs.(a) We depict a hypergraph and its factor graph equivalent. In the factor graph $\mathcal{F}$, function nodes represent hyperedges. Notice that, while the node sets are the same in both representations, due to the presence of all possible hyperedges in the log-likelihood in \ref{['eq: loglik']}, the factor graph does not only contain the observed interactions $E$ (black), but also the unobserved ones $\Omega \setminus E$ (gray). (b) In factor graphs, there are two types of messages: variable-to-function node $q$ (red), and function-to-variable node $\hat{q}$ (blue).
  • Figure 2: Local tree assumption.(a) The classical local tree assumption for graphs. Here, it is assumed that the neighborhoods of nodes are approximately trees. (b) The tree assumption for factor graphs. Here, a path from a leaf (light blue) to a root (orange) consists of steps alternating variable nodes and function nodes. These two representations coincide in the case of graphs. (c) The perturbations propagate up the tree via the messages. In graphs (a), they reach the root passing from nodes $i_{r+1}$ to $i_{r}$ (green). In hypergraph-induced factor graphs, perturbations spread from a node $i_{r+1}$, at depth $r+1$, to its neighboring function nodes $f_{r+1}$ (red), and up to node $i_r$ at depth $r$ (blue) in an alternating fashion.
  • Figure 3: Phase transition. The overlap between ground truth and inferred communities varies for different $c_{\mathrm{out}}/c_{\mathrm{in}}$ ratios. The values attained are positive on the detectable region (left of the dotted theoretical bounds) and continuously drop to zero as the phase transition boundary approaches. Values for hyperedges up to size $D=50$ (orange) always yield higher overlap compared to $D=2$ (light blue). Shaded areas are standard deviations over $5$ random initializations of MP.
  • Figure 4: Theoretical phase transition. Due to the decomposition of our bound in eq: bound factor 1eq: gamma_term_detectability it is possible to separately describe the effects of $K$, $c$ and $D$ on the predicted phase transition. (a) Detectability bounds for networks $(D=2)$. Increasing $c$ yields a broader range of detectable configurations (colored areas) for $\rho_{\mathrm{in}}$. The number of communities skews detectability: while for $K=2$ communities can be detected in extremely disassortative regimes ($\rho_{\mathrm{in}}$ close to zero), when more communities are present, only assortative networks are detectable. (b) Effect of the maximum hyperedge size $D$. The term $\gamma(D)$ in \ref{['eq: gamma_term_detectability']} can be split into the product $\gamma_1(D)\gamma_2(D)$, as defined in eq: gamma 1eq: gamma 2. The non-trivial decrease of $\gamma(D)$ results from the interplay of $\gamma_1(D)$ and $\gamma_2(D)$, having opposite monotonicity. (c) The percentage decrease $\Delta \Phi(K,c,D) = (\Phi(K,c,D) - \Phi(K,c,2)) / \Phi(K,c,2)$ in detectability for different $c, D$ values shows that higher-order interactions steadily improve detection, especially in sparse regimes.
  • Figure 5: Experiments on the High School dataset. We infer the communities via MP and EM on the High School dataset. In all cases, we run inference with $K=10$ communities. (a) Inferred communities on the High School dataset, only utilizing hyperedges up to a maximum size $D$. Taking into account higher-order information, up to $D=4$, results in more granular partitions. (b) Graphical representation of the students' partition into classes. We draw only hyperedges of size $D$. (c) We compare the inferred partitions with the "attended class" covariate of the nodes, i.e., the classes students participate in. We comment further on this comparison in \ref{['apxsec: affinity on high school']}. (d) A quantitative measurement complementing that of panel (b): the Normalized Mutual Information (NMI) between inferred communities and attended classes, the AUC on the full dataset, as well as the ratio $\rho_D$ of hyperedges of size equal to $D$. (e) Free energy landscape. We consider the parameters $(p_2, n_2)$, $(p_3, n_3)$ and $(p_4, n_4)$ inferred from the dataset with, respectively, $D=2, 3, 4$. With these, we build the simplex of convex combinations $p = \sum_{i \in \{2,3,4\}} \lambda_i p_i$, where $\sum_{i \in \{2,3,4\}} \lambda_i = 1$ and $0 \leq \lambda_i \leq 1$ (similarly for $n$). For every point in the simplex, we compute the free energy on the full dataset, i.e., with $D=5$. More details on these computations are provided in \ref{['apxsec: free energy on high school']}.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Proposition 1
  • Lemma 1
  • Lemma 2: Employing \ref{['th: sum and product lemma']}