Table of Contents
Fetching ...

Local dominance unveils clusters in networks

Dingyi Shi, Fan Shang, Bingsheng Chen, Paul Expert, Linyuan Lü, H. Eugene Stanley, Renaud Lambiotte, Tim S. Evans, Ruiqi Li

TL;DR

A linear algorithm based on local information to identify centers and related hierarchical structure for effective community detection is proposed, which can enhance clustering vector data as well.

Abstract

Clusters or communities can provide a coarse-grained description of complex systems at multiple scales, but their detection remains challenging in practice. Community detection methods often define communities as dense subgraphs, or subgraphs with few connections in-between, via concepts such as the cut, conductance, or modularity. Here we consider another perspective built on the notion of local dominance, where low-degree nodes are assigned to the basin of influence of high-degree nodes, and design an efficient algorithm based on local information. Local dominance gives rises to community centers, and uncovers local hierarchies in the network. Community centers have a larger degree than their neighbors and are sufficiently distant from other centers. The strength of our framework is demonstrated on synthesized and empirical networks with ground-truth community labels. The notion of local dominance and the associated asymmetric relations between nodes are not restricted to community detection, and can be utilised in clustering problems, as we illustrate on networks derived from vector data.

Local dominance unveils clusters in networks

TL;DR

A linear algorithm based on local information to identify centers and related hierarchical structure for effective community detection is proposed, which can enhance clustering vector data as well.

Abstract

Clusters or communities can provide a coarse-grained description of complex systems at multiple scales, but their detection remains challenging in practice. Community detection methods often define communities as dense subgraphs, or subgraphs with few connections in-between, via concepts such as the cut, conductance, or modularity. Here we consider another perspective built on the notion of local dominance, where low-degree nodes are assigned to the basin of influence of high-degree nodes, and design an efficient algorithm based on local information. Local dominance gives rises to community centers, and uncovers local hierarchies in the network. Community centers have a larger degree than their neighbors and are sufficiently distant from other centers. The strength of our framework is demonstrated on synthesized and empirical networks with ground-truth community labels. The notion of local dominance and the associated asymmetric relations between nodes are not restricted to community detection, and can be utilised in clustering problems, as we illustrate on networks derived from vector data.
Paper Structure (16 sections, 6 figures, 4 tables)

This paper contains 16 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Schematic illustration of the Local Search (LS) algorithm. (A) An example network where digits on nodes and size of nodes indicate the degree. (B) The identification of local leaders based on local dominance by creating a forest of DAGs as indicated by short dashed directed edges. For each node $u$, it points to any adjacent neighbor $v$ with $k_v \geq k_u$ and $k_v=\max\{k_z | z \in \mathbf{V}(u) \}$, where $\mathbf{V}(u)$ is the set of neighboring nodes. In this example, nodes are traversed by their lexicographical order, when node b is traversed, it points to m as $k_m=\max\{k_z | z \in \mathbf{V}(b) \} \geq k_b$; later, when m is traversed, it has no out-going link, and so m is identified as a local leader: it does not point to any of its followers and its remaining neighbors all have smaller degrees. When there are more than one neighbor with the same largest degree, more than one directed edge is temporarily added, e.g., node c points to both b and m as $k_b=k_m=\max\{k_z | z \in \mathbf{V}(c) \} \geq k_c$; nodes d and l also have more than one outgoing link. The local leaders, which are potential community centers, are $f$, $m$, and $p$ (indicated by dark grey color). (C) Each node randomly retains just one out-going edge shown as a short dashed directed edge (e.g., c can point to b or m with an equal probability, similarly for l and d). Then, for each local leader $u$, a local-BFS is performed to find its nearest local leader with $k_v\geq k_u$, and the shortest path length on network $d_{uv}, \forall v$ is designated by $l_u$. Here, $p\rightarrow f$ with $l_p=2$, and $f\rightarrow m$ with $l_f=4$. In (C), short-dash arrows and long-dash arrows correspond to pure followers (whose $l_u=1$) and local leaders (whose $l_u\geq 2$), respectively. Each node has at most one out-going link $(u\rightarrow v)$, which can go beyond direct connections. The local leader(s) with the maximal degree has no out-going link (here node m). (D) The corresponding tree structure formed by local dominance. The scale on the left is a visual aid for calculating $l_i$ between connected nodes in the DAG. (E) The scatter plot of $k_i$ and $l_i$ for all nodes. Community centers are of both a larger degree $k_i$ and a longer $l_i$. (F) The decision graph for quantitatively determining community centers (indicated by triangles) based on the product of rescaled degree $\tilde{k}_i$ and rescaled distance $\tilde{l}_i$ (see more details in Supplementary Note 1.2). Community centers can be detected by a visual inspection for obvious gaps or sophisticated automatic detection methods. Here, two centers, nodes m and f, are identified. The color of nodes in (C) and (D) represents the community partition, and community centers are highlighted by a darker hue of the same color.
  • Figure 2: Community partitions by the LS and Louvain algorithms on synthesized networks with different strength of heterogeneity. The heterogeneity increases from left to right. The color of nodes denotes the community membership. In a strict homogeneous regular network ($N=36, \langle k\rangle=4$), all nodes are identical, (A) only one community is detected by the LS algorithm (see Supplementary Fig. 1 for more details); (D) by contrast, the Louvain algorithm detects five communities by optimizing modularity. In an Erdős-Rényi random network ($N=64, \langle k\rangle=4$), there may exist some communities due to randomness reichardt2006networks, (B) the LS algorithm detects fewer communities compared to (E) the Louvain algorithm (see Supplementary Fig. 2). In a Ravasz-Barabási network ravasz2003hierarchical which displays stronger heterogeneity, (C) the LS algorithm groups all first-level nodes and all sixteen second-level peripheral clusters into one community, and four small communities emerge (see Supplementary Fig. 4 for more details); (F) the Louvain algorithm partitions each second-level branching as a separate community and misclassifies a first-level peripheral cluster into its own community, a result of traversal order and modularity optimization process in the Louvain algorithm.
  • Figure 3: Detection of multiscale community structure with different heterogeneity. The network in (A) comprise four top-level communities (labeled as a, b, c, and d) with 400 nodes each and an inter-connection probability $p_1=0.0002$, each of which further comprises four second-level communities with 100 nodes and $p_2=0.035$ (e.g., community c comprises c1, c2, c3, and c4). The second-level communities are generated by the Barabási-Albert model barabasi1999emergence with $m=7$, which leads to an average degree $\langle k\rangle =14$. (B) shows the decision graph for the LS method when analyzing the network in (A). (C) displays the tree structure formed by the local dominance between identified centers of each community. For better clarity, community centers are named by the community label instead of the real index of the node, and we only show the tree structure of these centers. The height difference indicates the $l_i$ of the lower node. (D)-(F) is the same as (A)-(C), with only changing the generation process of second-level communities to the Erdős-Rényi random network with a connection probability $p=0.14$, which still leads to the same average degree $\langle k\rangle =14$. In such a setting, similar to SBM, nodes in the network are again relatively homogeneous. For better clarity, in (E) and (F) only top sixteen centers are labeled and their affiliation relation are visualized, and in total, LS detects 29 centers at the second-level for this network. For the multiscale network in A, the LS method detects four top-level communities with $F_1=0.99$ and 16 second-level communities with $F_1=0.56$. For the network in D, the LS method detects four top-level communities with $F_1=0.89$ and 29 second-level communities with $F_1=0.29$. In both cases, the Louvain algorithm only obtain four communities, which corresponds to the first-level ones, with $F_1$ equals $1$, however, it cannot detect second-level partitions. By comparing results in (A)-(C) and in (D)-(F), we can find that our LS algorithm works well on networks with stronger heterogeneity. Results shown here correspond to just one realization, in multiple realizations, as every first- and second-level communities are equivalent, the label sequence in B and E and the tree structure in C and F may vary but have a consistent structure.
  • Figure 4: The community structure detected by our LS algorithm on mobility flow networks in three diversified cities across continents. (A) Dakar in Senegal, Africa. (B) Abidjan in Côte d'Ivoire, Africa. (C) Beijing in China, Asia. Each dot represents a location, which corresponds to a region by Voronoi tessellation according to cellphone towers. Communities are indicated by different colors, and their centers are marked as stars. The decision graphs are shown in Supplementary Fig. 8.
  • Figure 5: Conversion from vector data to a network via the $\varepsilon$-ball method and the analogy between the community centers of networks and the cluster centers of vector data. (A) An example of data cloud and (B) its dicretised network representation by (Inset) the $\varepsilon$-ball method. (C) The decision graph by the density and distance based (DDB) algorithm rodriguez2014clustering. (D) The decision graph by the LS method. Cluster centers are data points of both a higher density $\rho_i$ than its neighbors and relatively far from other points with a larger density (i.e., a large $d_i$) rodriguez2014clustering. The density $\rho_i$ of a data point $i$ is simply the number of nodes within a certain radius $\epsilon$, and it is equivalent to the degree of node $i$ in the corresponding network (i.e., $k_i=\rho_i$). The network constructing process is a coarse-graining and discretization process, where the absolute distance value is not preserved (e.g., in the Inset, $d_{32}>d_{34}$ for the original vector data, but $l_{32}=l_{34}=1$ in the network). The Euclidean distance between any data points is based on a global metric, but the topological path length between two nodes are based on a local metric. For example, $d_{24}$ is only slightly larger than $d_{34}$, but in the network, $l_{24}=2$ and $l_{23}=1$ (see the Inset); though $d_{21}\approx 2d_{23}$ according to global metric, node $2$ and node $1$ are not reachable in the network based on the local metric. Cluster centers identified by the DDB algorithm matches community centers identified by the LS method, which are all marked as stars.
  • ...and 1 more figures