Table of Contents
Fetching ...

Finding community structure in very large networks

Aaron Clauset, M. E. J. Newman, Cristopher Moore

TL;DR

The paper tackles the challenge of detecting community structure in very large networks by introducing a hierarchical agglomeration algorithm that greedily optimizes modularity. It achieves this with efficient sparse-data structures, yielding a running time of O(m d log n) and practical linear-time behavior on sparse, hierarchical networks. The authors demonstrate the approach on a large Amazon co-purchasing network, obtaining a high modularity Q=0.745 with 1684 communities and uncovering meaningful, genre-like groupings as well as satellite and bridge communities. This work substantially extends the scalability of community detection to networks with millions of vertices and tens of millions of edges, enabling broader application and analysis of large-scale complex systems.

Abstract

The discovery and analysis of community structure in networks is a topic of considerable recent interest within the physics community, but most methods proposed so far are unsuitable for very large networks because of their computational cost. Here we present a hierarchical agglomeration algorithm for detecting community structure which is faster than many competing algorithms: its running time on a network with n vertices and m edges is O(m d log n) where d is the depth of the dendrogram describing the community structure. Many real-world networks are sparse and hierarchical, with m ~ n and d ~ log n, in which case our algorithm runs in essentially linear time, O(n log^2 n). As an example of the application of this algorithm we use it to analyze a network of items for sale on the web-site of a large online retailer, items in the network being linked if they are frequently purchased by the same buyer. The network has more than 400,000 vertices and 2 million edges. We show that our algorithm can extract meaningful communities from this network, revealing large-scale patterns present in the purchasing habits of customers.

Finding community structure in very large networks

TL;DR

The paper tackles the challenge of detecting community structure in very large networks by introducing a hierarchical agglomeration algorithm that greedily optimizes modularity. It achieves this with efficient sparse-data structures, yielding a running time of O(m d log n) and practical linear-time behavior on sparse, hierarchical networks. The authors demonstrate the approach on a large Amazon co-purchasing network, obtaining a high modularity Q=0.745 with 1684 communities and uncovering meaningful, genre-like groupings as well as satellite and bridge communities. This work substantially extends the scalability of community detection to networks with millions of vertices and tens of millions of edges, enabling broader application and analysis of large-scale complex systems.

Abstract

The discovery and analysis of community structure in networks is a topic of considerable recent interest within the physics community, but most methods proposed so far are unsuitable for very large networks because of their computational cost. Here we present a hierarchical agglomeration algorithm for detecting community structure which is faster than many competing algorithms: its running time on a network with n vertices and m edges is O(m d log n) where d is the depth of the dendrogram describing the community structure. Many real-world networks are sparse and hierarchical, with m ~ n and d ~ log n, in which case our algorithm runs in essentially linear time, O(n log^2 n). As an example of the application of this algorithm we use it to analyze a network of items for sale on the web-site of a large online retailer, items in the network being linked if they are frequently purchased by the same buyer. The network has more than 400,000 vertices and 2 million edges. We show that our algorithm can extract meaningful communities from this network, revealing large-scale patterns present in the purchasing habits of customers.

Paper Structure

This paper contains 4 sections, 12 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The modularity $Q$ over the course of the algorithm (the $x$ axis shows the number of joins). Its maximum value is $Q=0.745$, where the partition consists of $1684$ communities.
  • Figure 2: A visualization of the community structure at maximum modularity. Note that the some major communities have a large number of "satellite" communities connected only to them (top, lower left, lower right). Also, some pairs of major communities have sets of smaller communities that act as "bridges" between them (e.g., between the lower left and lower right, near the center).
  • Figure 3: Cumulative distribution of the sizes of communities when the network is partitioned at the maximum modularity found by the algorithm. The distribution appears to follow a power law form over two decades in the central part of its range, although it deviates in the tail. As a guide to the eye, the straight line has slope $-1$, which corresponds to an exponent of $\alpha=2$ for the raw probability distribution.