Table of Contents
Fetching ...

Geodesic Length Distribution in Sparse Network Ensembles

Sahil Loomba, Nick S. Jones

TL;DR

This paper derives an analytic geodesic length distribution (GLD) for node pairs in sparse networks by establishing a recursive, probabilistic framework on sparse ensemble average networks (SEANs) and its generalization to sparse general random networks (SGRNs). It defines survival and conditional-PMF matrices, derives closed-form and approximate closed-form GLD expressions, and connects them to an integral-operator framework that yields spectral forms when the kernel is symmetric. The contributions span (i) a rigorous supercritical/subcritical treatment with percolation probabilities, (ii) a unifying closed-form GLD via matrix/operator exponentials and, for symmetric kernels, eigen-decompositions, and (iii) detailed model-specific instantiations for SBM, RDPG, Gaussian RGG, and sparse graphons with illustrative insights and empirical validation. The results enable analytic access to distances, centralities, and connectedness properties in very large, sparse networks and offer practical paths for inference and graph-learning tasks on partially observed data. Overall, the work provides a versatile, theory-grounded toolkit for geodesic statistics in diverse sparse network models with potential impact on network inference, coarsening, and representation learning.

Abstract

A key task in the study of networked systems is to derive local and global properties that impact connectivity, synchronizability, and robustness; computing shortest paths or geodesics yields measures of network connectivity that can explain such phenomena. We derive an analytic distribution of geodesic lengths on the giant component in the supercritical regime -- when the giant component exists -- or on small components in the subcritical regime, of any sparse (and possibly directed) network with conditionally independent edges, in the infinite-size limit. We provide specific results for widely used network models like stochastic block models, dot product graphs, random geometric graphs, and sparse graphons. The survival function of the geodesic length distribution possesses a simple closed-form expression which is asymptotically tight for finite lengths, has a natural interpretation of traversing independent geodesics in the network, and delivers novel insight into the aforementioned network families.

Geodesic Length Distribution in Sparse Network Ensembles

TL;DR

This paper derives an analytic geodesic length distribution (GLD) for node pairs in sparse networks by establishing a recursive, probabilistic framework on sparse ensemble average networks (SEANs) and its generalization to sparse general random networks (SGRNs). It defines survival and conditional-PMF matrices, derives closed-form and approximate closed-form GLD expressions, and connects them to an integral-operator framework that yields spectral forms when the kernel is symmetric. The contributions span (i) a rigorous supercritical/subcritical treatment with percolation probabilities, (ii) a unifying closed-form GLD via matrix/operator exponentials and, for symmetric kernels, eigen-decompositions, and (iii) detailed model-specific instantiations for SBM, RDPG, Gaussian RGG, and sparse graphons with illustrative insights and empirical validation. The results enable analytic access to distances, centralities, and connectedness properties in very large, sparse networks and offer practical paths for inference and graph-learning tasks on partially observed data. Overall, the work provides a versatile, theory-grounded toolkit for geodesic statistics in diverse sparse network models with potential impact on network inference, coarsening, and representation learning.

Abstract

A key task in the study of networked systems is to derive local and global properties that impact connectivity, synchronizability, and robustness; computing shortest paths or geodesics yields measures of network connectivity that can explain such phenomena. We derive an analytic distribution of geodesic lengths on the giant component in the supercritical regime -- when the giant component exists -- or on small components in the subcritical regime, of any sparse (and possibly directed) network with conditionally independent edges, in the infinite-size limit. We provide specific results for widely used network models like stochastic block models, dot product graphs, random geometric graphs, and sparse graphons. The survival function of the geodesic length distribution possesses a simple closed-form expression which is asymptotically tight for finite lengths, has a natural interpretation of traversing independent geodesics in the network, and delivers novel insight into the aforementioned network families.

Paper Structure

This paper contains 28 sections, 13 theorems, 193 equations, 19 figures, 1 table.

Key Result

Lemma 1

For nodes $i,j$ in a sparse ensemble average network $G$ that is undirected, if $i$ is a percolating node $(\mathbb{P}\left(\phi_i(G)\right)=\Omega\left(1\right))$ and $j$ is a percolating or non-percolating node $(\mathbb{P}\left(\phi_j(G)\right)=\Omega\left(1\right)$ or $0)$, then asymptotically:

Figures (19)

  • Figure 1: Analytic CDF of geodesic lengths for an ER graph agree with the empirical CDF, where the source node is on the giant component. Network size is fixed at $n=1024$ and mean degree at $\left\langle d\right\rangle=2$. Solid and dotted lines indicate analytic solutions derived from analytic (Eqs. \ref{['eq:spd_main']}, \ref{['eq:prob_connect_exact']}, \ref{['eq:gcc_consistency']}) and approximate analytic forms (Eqs. \ref{['eq:spd_main']}, \ref{['eq:prob_connect_apx']}) respectively, while dash-dotted line indicates the closed form obtained from Eq. \ref{['eq:sf_avg']}. Symbols ($\circ$) and bars indicate empirical estimates: mean and standard error over 10 network samples. Dashed asymptote indicates size of the giant component as estimated from the self-consistent Eq. \ref{['eq:gcc_consistency']}. The approximate analytic form marginally underestimates the probability mass for shorter lengths, as is evident on the logarithmic scale (main plot). The closed form shows good agreement for shorter lengths, but deviates strongly for longer ones (inset plot on the linear scale)---saturating to unity for any percolating network. There is good agreement between the analytic and empirical estimates, with some deviation around the mode of the distribution due to finite-size effects---see Appendix \ref{['sec:apdx_finite_size']}.
  • Figure 2: Empirical, analytic and closed-form CDF of geodesic lengths where the source node is on the giant component, for an ER graph with varying connectivity. Network size is fixed at $n=1024$, while mean degree varies as $\left\langle d\right\rangle\in\{1.25, 1.5, 2, 4, 8, 16\}$. Solid and dotted lines indicate analytic (Eqs. \ref{['eq:spd_main']}, \ref{['eq:prob_connect_exact']}, \ref{['eq:gcc_consistency']}) and closed-form solutions (Eqs. \ref{['eq:prob_connect_exact']}, \ref{['eq:gcc_consistency']}, \ref{['eq:sf_avg']}), respectively. Symbols ($\circ$) and bars indicate empirical estimates: mean and standard error over 10 network samples. Dashed asymptote indicates size of the giant component as estimated from the self-consistent Eq. \ref{['eq:gcc_consistency']}. The analytic GLD is in good agreement for all connectivities at all lengths, while the closed-form GLD is in good agreement for all connectivities at shorter lengths (with deviation beginning around the mode).
  • Figure 3: The closed-form GLD can be decomposed over the eigenfunctions $\{\varphi_i(x)\}_{i=1}^N$ of the integral operator $T$ of the connectivity kernel $\nu(x,y)$ (Eqs. \ref{['eq:spd_analytic_general_eig']}, \ref{['eq:spd_analytic_general_eig_uncorrected']}), shown here for a 32-block SBM (over $V=[32]$) formed by discretizing a Gaussian RGG (over $V=\mathbb{R}$); see Fig. \ref{['fig:spl_grgg_sbm_rank1']} in Appendix \ref{['sec:apdx_general']} for model details. When sorted by decreasing eigenvalues $\tau_i$, eigenfunctions $\varphi_i$ of even-numbered $i$ do not contribute to the bound on the approximate closed form of the survival function of the GLD for an average node pair, in Eq. \ref{['eq:spd_general_psi_agg_bound']}.
  • Figure 4: Empirical, analytic, and approximate closed-form CDF of geodesic lengths where the source node is on the giant component, agree with each other for a bipartite SBM. The block matrix $\mathbf{B}=(0880)$, distribution vector $\boldsymbol\pi=(0.2, 0.8)$, and network size $n=1024$. Rows correspond to the block membership of source node. Left column depicts the PMF, which highlights bipartitivity of the network, and the right column depicts the CDF, whose tail value agrees with the percolation probability of target node, indicated by the dashed asymptote and calculated from Eq. \ref{['eq:gcc_consistency_sbm']}. Solid lines represent analytic form using Eqs. \ref{['eq:gcc_consistency_sbm']}--\ref{['eq:spd_sbm_init']}, while dotted lines represent approximate closed form using Eq. \ref{['eq:spd_sbm']}, and symbols ($\circ$) with bars represent empirics, i.e. mean and standard error over 10 samples.
  • Figure 5: The empirical average geodesic lengths of a real-world email network $\mathbf{A}_\text{eue}$ (on the $x$-axis) are well-approximated by the average obtained from the analytic form of the GLD (on the $y$-axis, using Eqs. \ref{['eq:gcc_consistency_sbm']}, \ref{['eq:spd_main_sbm']}, \ref{['eq:spd_sbm_init']}), when nodes with the same "label" are grouped into a single block to form an SBM with $k$ blocks. Panels (a--d) correspond to different labelings: (top-left) $\mathbf{Z}_\text{dep}$ leverages the homophily assumption by using a node attribute as the label---here, the department of the e-mailer; (top-right) $\mathbf{Z}_\text{mod}$ uses modularity maximization clauset2004modmaxhagberg2008networkx to infer network modules assigned as the label; (bottom) $\mathbf{Z}_\text{sbm2}, \mathbf{Z}_\text{sbm3}$ use a hierarchical SBM peixoto2014nestedsbmpeixoto2014graphtool to infer a hierarchy of blocks which allows for multi-level coarsening. Points indicate the mean geodesic length between nodes of block pairs. To confirm that any deviations are only due to suboptimal SBM fits or having a single empirical network, insets (e--h) show the mean (black markers) and standard deviation (red bars) of geodesic lengths between block pairs, averaged over 10 samples of the corresponding SBM fit via coarsening. Both the mean and standard deviation of the empirical GLD are well approximated by the analytic GLD.
  • ...and 14 more figures

Theorems & Definitions (26)

  • Lemma 1: Connection probability on the giant component
  • proof
  • Corollary 1.1: Sparse connection probability of percolating nodes
  • proof
  • Lemma 2: Vanishing bridging probability
  • proof
  • Corollary 2.1: Vanishing probability of shared bridging nodes
  • proof
  • Lemma 3: First-order bridging probability
  • proof
  • ...and 16 more