Table of Contents
Fetching ...

Uncovering the overlapping community structure of complex networks in nature and society

Gergely Palla, Imre Derenyi, Illes Farkas, Tamas Vicsek

TL;DR

After defining a set of new characteristic quantities for the statistics of communities, this work applies an efficient technique for exploring overlapping communities on a large scale and finds that overlaps are significant, and the distributions introduced reveal universal features of networks.

Abstract

Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins, industrial sectors and groups of people) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.

Uncovering the overlapping community structure of complex networks in nature and society

TL;DR

After defining a set of new characteristic quantities for the statistics of communities, this work applies an efficient technique for exploring overlapping communities on a large scale and finds that overlaps are significant, and the distributions introduced reveal universal features of networks.

Abstract

Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of. A key question is how to interpret the global organization of such networks as the coexistence of their structural subunits (communities) associated with more highly interconnected parts. Identifying these a priori unknown building blocks (such as functionally related proteins, industrial sectors and groups of people) is crucial to the understanding of the structural and functional properties of networks. The existing deterministic methods used for large networks find separated communities, whereas most of the actual networks are made of highly overlapping cohesive groups of nodes. Here we introduce an approach to analysing the main statistical features of the interwoven sets of overlapping communities that makes a step towards uncovering the modular structure of complex systems. After defining a set of new characteristic quantities for the statistics of communities, we apply an efficient technique for exploring overlapping communities on a large scale. We find that overlaps are significant, and the distributions we introduce reveal universal features of networks. Our studies of collaboration, word-association and protein interaction graphs show that the web of communities has non-trivial correlations and specific scaling properties.

Paper Structure

This paper contains 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the concept of overlapping communities. a) The black dot in the middle represents either of the authors of this Letter, with several of his communities around. Zooming into the scientific community demonstrates the nested and overlapping structure of the communities, while depicting the cascades of communities starting from some members exemplifies the interwoven structure of the network of communities. b) Divisive and agglomerative methods grossly fail to identify the communities when overlaps are significant. c) An example of overlapping $k$-clique-communities at $k=4$. The yellow community overlaps with the blue one in a single node, whereas it shares two nodes and a link with the green one. These overlapping regions are emphasised in red. Notice that any $k$-clique (complete subgraph of size $k$) can be reached only from the $k$-cliques of the same community through a series of adjacent $k$-cliques. Two $k$-cliques are adjacent if they share $k-1$ nodes.
  • Figure 2: The community structure around a particular node in three different networks. The communities are colour coded, the overlapping nodes and links between them are emphasised in red, and the volume of the balls and the width of the links are proportional to the total number of communities they belong to. For each network the value of $k$ has been set to 4. a) The communities of G. Parisi in the co-authorship network of the Los Alamos cond-mat archive (for threshold weight $w^*=0.75$) can be associated with his fields of interest. b) The communities of the word "bright" in the South Florida Free Association norms list (for $w^*=0.025$) represent the different meanings of this word. c) The communities of the protein ZDS1 in the DIP core list of the protein-protein interactions of S. cerevisiae can be associated with either protein complexes or certain functions.
  • Figure 3: Network of the 82 communities in the DIP core list of the protein-protein interactions of S. cerevisiae for $k=4$. The area of the circles and the width of the links are proportional to the size of the corresponding communities ($s^{\rm com}_\alpha$) and to the size of the overlaps ($s^{\rm ov}_{\alpha,\beta}$), respectively. The coloured communities are cut out and magnified to reveal their internal structure. In this magnified picture the nodes and links of the original network have the same colour as their communities, those that are shared by more than one community are emphasised in red, and the grey links are not part of these communities. The area of the circles and the width of the links are proportional to the total number of communities they belong to.
  • Figure 4: Statistics of the $k$-clique-communities for three large networks. These are the co-authorship network of the Los Alamos cond-mat archive (triangles, $k$=6, $f^*=0.93$), the word association network of the South Florida Free Association norms (squares, $k=4$, $f^*=0.67$), and the protein interaction network of the yeast S. cerevisiae from the DIP database (circles, $k=4$). (a) The cumulative distribution function of the community size follows a power law with exponents between $-1$ (upper line) and $-1.6$ (lower line). (b) The cumulative distribution of the community degree starts exponentially and then crosses over to a power law (with the same exponent as for the community size distribution). Plot (c) is the cumulative distribution of the overlap size and (d) is that of the membership number.