Table of Contents
Fetching ...

Identifying hubs in directed networks

Alec Kirkley

TL;DR

A set of efficient nonparametric methods that classify hub nodes in directed networks using the Minimum Description Length principle are developed, effectively providing a clear and principled definition for network hubs.

Abstract

Nodes in networks that exhibit high connectivity, also called ``hubs'', play a critical role in determining the structural and functional properties of networked systems. However, there is no clear definition of what constitutes a hub node in a network, and the classification of network hubs in existing work has either been purely qualitative or relies on ad hoc criteria for thresholding continuous data that do not generalize well to networks with certain degree sequences. Here we develop a set of efficient nonparametric methods that classify hub nodes in directed networks using the Minimum Description Length principle, effectively providing a clear and principled definition for network hubs. We adapt our methods to both unweighted and weighted networks and demonstrate them in a range of example applications using real and synthetic network data.

Identifying hubs in directed networks

TL;DR

A set of efficient nonparametric methods that classify hub nodes in directed networks using the Minimum Description Length principle are developed, effectively providing a clear and principled definition for network hubs.

Abstract

Nodes in networks that exhibit high connectivity, also called ``hubs'', play a critical role in determining the structural and functional properties of networked systems. However, there is no clear definition of what constitutes a hub node in a network, and the classification of network hubs in existing work has either been purely qualitative or relies on ad hoc criteria for thresholding continuous data that do not generalize well to networks with certain degree sequences. Here we develop a set of efficient nonparametric methods that classify hub nodes in directed networks using the Minimum Description Length principle, effectively providing a clear and principled definition for network hubs. We adapt our methods to both unweighted and weighted networks and demonstrate them in a range of example applications using real and synthetic network data.
Paper Structure (15 sections, 21 equations, 4 figures, 1 table)

This paper contains 15 sections, 21 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Diagram of hub-based encodings. (a) Schematic of the simple directed graph encoding described in Sec. \ref{['sec:simple']}, along with the description length of each step. (b) Schematic of the weighted directed graph/multigraph encoding described in Sec. \ref{['sec:multigraph']}, along with the description length of each step. We define the hub nodes of a network $G=(V,E)$ according to a given encoding (ERs, CMs, ERm, or CMm) as the node subset $V_h\subseteq V$ that minimizes the information required to transmit the network (e.g. the positions of the edges $E$) when transmitting the positions of edges incident to the hubs first. This provides a principled, nonparametric criterion for identifying hubs in directed networks based on the Minimum Description Length (MDL) principle.
  • Figure 2: Identifying hubs in networks with different in-degree distributions. (a) Fraction of nodes $h^\ast/N$ identified as hubs using the four methods detailed in Table \ref{['tab:methods']}, for Poisson distributed weighted in-degrees. Experiments were performed over a broad range of average in-degree for $N=10^3$ (solid lines) and $N=10^5$ (dotted lines). (b) Inverse compression ratio $\eta$ (Eq. \ref{['eq:etam']}) for the ERm and CMm methods over the same set of experiments. The experiments were repeated for Geometrically distributed in-degrees (panels (c) and (d)) and Power Law (Zipf) distributed in-degrees (panels (e) and (f)), which exhibit progressively higher levels of relative variance. Error bars indicate two standard errors in the mean over 50 generated in-degree distributions, and large circles/squares around the data points in panel (b) indicate configurations for which the ERm model provided superior compression to the CMm model.
  • Figure 3: Identifying hub transitions in Price's model with different attachment exponents and seed sets. (a)-(d) The number of hubs $h^\ast$ identified by the four methods in Table \ref{['tab:methods']} is shown as a function of the number of time steps in a generalization of Price's network growth model (Eq. \ref{['eq:price']}) price1976generalkrapivsky2000connectivity for various attachment exponents $\alpha$ and numbers of seed nodes $m$ (dashed black lines). Error bars indicate two standard errors in the mean over 50 growth simulations with $T=100$ timesteps. (e)-(f) Expected number of timesteps until a single hub is detected (the "hub transition") over a range of attachment exponents and seed set sizes, for the ERs (Eq. \ref{['eq:LERs']}) and CMs (Eq. \ref{['eq:LCMs']}) hub identification objectives. Small white squares indicate the parameter values corresponding to panels (a)-(d).
  • Figure 4: Hub properties of real-world directed networks. (a) The fraction $h^\ast/N$ of nodes identified as hubs using the four methods in Table \ref{['tab:methods']}, for 82 real-world directed networks of various sizes collected from the Netzschleuder repository netzschleuder. The median fraction of hubs found across all networks is shown with a dashed line for each method. See Appendix \ref{['appendix:networks']} for details on the networks studied. (b) Spearman rank correlation in the fraction of nodes identified as hubs across all networks in the corpus, for each pair of methods examined. (c) Fraction of nodes identified as hubs vs the normalized degree entropy (Eq. \ref{['eq:entropy']}) for the four methods. Spearman correlations between $h^\ast /N$ and the normalized degree entropy values are reported in the legend, and the marker colors/styles correspond to the methods indicated in panel (a). (d) Inverse compression ratios (Eq. \ref{['eq:etas']} for simple graphs and Eq. \ref{['eq:etam']} for weighted graphs) across all networks when using the ER and CM encodings (x- and y-axes respectively). The points are scaled monotonically with the size $N$ of the network analyzed, and red (blue) markers indicate that the ER (CM) encoding was more compressive for the given network. The inset shows a zoomed in view of the plot for $0.95\leq \eta\leq 1$, and the line of equality is shown as a dashed line for reference.