Table of Contents
Fetching ...

Parallel Algorithms for Median Consensus Clustering in Complex Networks

Md Taufique Hussain, Mahantesh Halappanavar, Samrat Chatterjee, Filippo Radicchi, Santo Fortunato, Ariful Azad

TL;DR

This work addresses the challenge of deriving a single, representative clustering of a graph from multiple input partitions by optimizing a median-consensus objective based on Mirkin distance. It introduces a graph-aware greedy algorithm that moves vertices along graph edges to minimize total disagreement, eliminating the need for quadratic memory and enabling parallel execution. A preprocessing step groups homogeneous partitions, after which the consensus is computed per group, with a parallel OpenMP implementation achieving substantial speedups on large-scale graphs. Empirical results on synthetic LFR benchmarks and real networks show improved accuracy over baselines and strong scalability, including multi-core speedups up to 64 cores and effective handling of graphs with hundreds of thousands of nodes.

Abstract

We develop an algorithm that finds the consensus of many different clustering solutions of a graph. We formulate the problem as a median set partitioning problem and propose a greedy optimization technique. Unlike other approaches that find median set partitions, our algorithm takes graph structure into account and finds a comparable quality solution much faster than the other approaches. For graphs with known communities, our consensus partition captures the actual community structure more accurately than alternative approaches. To make it applicable to large graphs, we remove sequential dependencies from our algorithm and design a parallel algorithm. Our parallel algorithm achieves 35x speedup when utilizing 64 processing cores for large real-world graphs from single-cell experiments.

Parallel Algorithms for Median Consensus Clustering in Complex Networks

TL;DR

This work addresses the challenge of deriving a single, representative clustering of a graph from multiple input partitions by optimizing a median-consensus objective based on Mirkin distance. It introduces a graph-aware greedy algorithm that moves vertices along graph edges to minimize total disagreement, eliminating the need for quadratic memory and enabling parallel execution. A preprocessing step groups homogeneous partitions, after which the consensus is computed per group, with a parallel OpenMP implementation achieving substantial speedups on large-scale graphs. Empirical results on synthetic LFR benchmarks and real networks show improved accuracy over baselines and strong scalability, including multi-core speedups up to 64 cores and effective handling of graphs with hundreds of thousands of nodes.

Abstract

We develop an algorithm that finds the consensus of many different clustering solutions of a graph. We formulate the problem as a median set partitioning problem and propose a greedy optimization technique. Unlike other approaches that find median set partitions, our algorithm takes graph structure into account and finds a comparable quality solution much faster than the other approaches. For graphs with known communities, our consensus partition captures the actual community structure more accurately than alternative approaches. To make it applicable to large graphs, we remove sequential dependencies from our algorithm and design a parallel algorithm. Our parallel algorithm achieves 35x speedup when utilizing 64 processing cores for large real-world graphs from single-cell experiments.
Paper Structure (19 sections, 10 equations, 10 figures)

This paper contains 19 sections, 10 equations, 10 figures.

Figures (10)

  • Figure 1: Four partitions $\textbf{P}_1, \textbf{P}_2, \textbf{P}_3,$ and $\textbf{P}_4$ of a graph with 12 vertices. Within each partition, vertices belonging to different clusters are shown in different colors. The cluster membership vector of each vertex is illustrated in the right figure. Different colors indicate the vertex's membership across the various clusters of the four input partitions. The distance between any pair of vertices is the Hamming distance between their vectors. For example, $\delta_{v_0v_1}=1$, $\delta_{v_0v_4}=0$, and $\delta_{v_3v_9}=2$.
  • Figure 2: A consensus clustering $\textbf{C}=\{c_1, c_2, c_3, c_4\}$ of the graph shown in Fig. \ref{['fig:input']}. We show the computation to identify the best move for $v_3$ using Eq. \ref{['eq:objective3']}. Since $c_4$ does not contain a neighbor of $v_3$, it is not considered in this computation. Based on this calculations, $v_3$ moves to the cluster $c_2$ since it reduces the total distance by 4.
  • Figure 3: Effect of outlier removal parameter when creating a consensus of (a) widely different partitions and (b) slightly different partitions. LFR benchmark networks with n=5000 are used for this experiment. For (a), 38 different partitions were involved; for (b), 16 different partitions were involved.
  • Figure 4: Accuracy of different consensus partitions for LFR benchmark graphs with n=5000, when slightly different input partitions are considered. Here input partitions are obtained by different runs of the Louvain algorithm. Since the ground truth communities are known for LFR benchmark graphs, partition accuracy is measured by the distance from the ground truth community (where lower distance values indicate better accuracy). The black line represents the accuracy of the input partitions used to build the consensus, while the colored lines represent various consensus methods. Points on each line represent the average from 10 sets of experiments, with the corresponding error bars indicating the maximum and minimum values observed across all 10 sets.
  • Figure 5: Accuracy of different partitions of LFR benchmark graphs with n=5000 when widely different partitions are considered. Here input partitions are obtained by different parameters of 14 different clustering algorithms. The black line represents the accuracy of the input partitions used to build the consensus, while the colored lines represent different consensus methods. Points on each line represent the average from 10 sets of experiments, with the corresponding error bars indicating the maximum and minimum values observed across all 10 sets. In all cases, a lower value in the Y-axis is better as it signifies the distance to ground truth is smaller.
  • ...and 5 more figures