Table of Contents
Fetching ...

Empirical Networks are Sparse: Enhancing Multi-Edge Models with Zero-Inflation

Giona Casiraghi, Georges Andres

TL;DR

The paper shows that empirical multi-edge networks are sparse and exhibit zero-inflation, which is not captured by traditional Poisson-based models like $G(N,p)$, SBM, or DCSBM. It introduces zero-inflated Poisson extensions (zi-$G(N,p)$, zi-SBM, zi-CLCM, zi-DCSBM) with EM-like parameter estimation, guided by moment matching and, where needed, Lambert $W$ solutions. Through Sociopatterns datasets, the authors demonstrate that zi-DCSBM and related models better reproduce the observed edge-count distributions, sparsity, heavy tails, and diffusion- and small-world-related metrics, compared to their non-zero-inflated counterparts. The work provides a more faithful framework for modeling real-world sparse networks and highlights avenues for further enhancements, including richer count distributions and specialized block-inference methods for zero-inflated structures.

Abstract

Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed, most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the $G(N,p)$, configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models that do not accurately represent complex systems and their dynamics.

Empirical Networks are Sparse: Enhancing Multi-Edge Models with Zero-Inflation

TL;DR

The paper shows that empirical multi-edge networks are sparse and exhibit zero-inflation, which is not captured by traditional Poisson-based models like , SBM, or DCSBM. It introduces zero-inflated Poisson extensions (zi-, zi-SBM, zi-CLCM, zi-DCSBM) with EM-like parameter estimation, guided by moment matching and, where needed, Lambert solutions. Through Sociopatterns datasets, the authors demonstrate that zi-DCSBM and related models better reproduce the observed edge-count distributions, sparsity, heavy tails, and diffusion- and small-world-related metrics, compared to their non-zero-inflated counterparts. The work provides a more faithful framework for modeling real-world sparse networks and highlights avenues for further enhancements, including richer count distributions and specialized block-inference methods for zero-inflated structures.

Abstract

Real-world networks are sparse. As we show in this article, even when a large number of interactions is observed, most node pairs remain disconnected. We demonstrate that classical multi-edge network models, such as the , configuration models, and stochastic block models, fail to accurately capture this phenomenon. To mitigate this issue, zero-inflation must be integrated into these traditional models. Through zero-inflation, we incorporate a mechanism that accounts for the excess number of zeroes (disconnected pairs) observed in empirical data. By performing an analysis on all the datasets from the Sociopatterns repository, we illustrate how zero-inflated models more accurately reflect the sparsity and heavy-tailed edge count distributions observed in empirical data. Our findings underscore that failing to account for these ubiquitous properties in real-world networks inadvertently leads to biased models that do not accurately represent complex systems and their dynamics.
Paper Structure (21 sections, 31 equations, 4 figures, 1 table)

This paper contains 21 sections, 31 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Empirical multi-edge networks are sparse. Traditional multi-edge models like the G(N,p) struggle to reflect real-world data characteristics. (A) Edge count distribution in Zachary's Karate Club showing bimodality. The red solid line represents the G(N,p) prediction, and the blue dashed line its zero-inflated counterpart. (B) Over time, edges accumulate between the same pairs of nodes in real-world networks. In grey, the interquartile range of the number of multi-edges per pair $\rho = m/\binom{N}{2}$ and the fraction of connected node pairs $d = M/\binom{N}{2}$, over all Sociopatterns datasets. The black lines denote the median values, while the dashed red line represents the expected fraction of connected pairs according to the G(N,p) model with the corresponding $m$ value. Note that while the model quickly predicts a fully connected network, the empirical network remains sparse, indicating that most interactions occur among existing pairs rather than forming new connections.
  • Figure 2: (top) edge count distributions for two exemplary Sociopatterns datasets (HS13 and KH). The grey bar plot shows the empirical edge count distribution. The height of a bar denotes the fraction of pairs in the network connected by a given range of multi-edges. In red, the expected edge count distribution according to a DCSBM whose blocks have been obtained by modularity maximisation. In blue, the expected edge count distribution according to its zero-inflated variant, fitted using the same blocks. (bottom) Cumulative error for the two models. In HS13, most of the difference between the two models can be attributed to the left side of the edge count distribution and pairs with low edge counts. In KH, not only is the DCSBM unable to capture the network sparsity, but it also fails to capture the heavy-tailed nature of the edge count distribution. The zi-DCSBM provides a better fit in both cases.
  • Figure 3: Comparison of the DCSBM and zi-DCSBM fits for the KH dataset. On the left side, the network is visualised as a multi-graph with parallel edges denoting multi-edges in log10 base (i.e., 1 edge represents one interaction, 2 parallel edges represent 10 interactions, and so on). Nodes are coloured according to the labels inferred by modularity maximisation. The "lens" plots show a random realisation from the DCSBM (bottom) and zi-DCSBM (top). On the right, the adjacency matrices of the random realizations are visualised against the empirical network. These plots clearly highlight how the DCSBM fails to capture the sparsity of the empirical data.
  • Figure 4: The zi-DCSBM captures properties of empirical networks significantly better that its non-zero inflated variant. (top) Percentage of the empirical average clustering coefficient captured by DCSBM (red) and zi-DCSBM (blue) for all the Sociopatterns datasets. (center) Percentage of the empirical average shortest path length captured by DCSBM (red) and zi-DCSBM (blue) for all the Sociopatterns datasets. (bottom) Percentage of the empirical spectral gap captured by DCSBM (red) and zi-DCSBM (blue) for all the Sociopatterns datasets. The expected values of the properties of each model have been computed from 1 000 realisations.