Parallel Motif-Based Community Detection
Tianyi Chen, Charalampos E. Tsourakakis
TL;DR
The paper tackles scalable community detection by evaluating motif-based methods and introducing a parallel MPI framework. It proposes Triangle-Wedges (TW) as a new edge-similarity score and demonstrates that motif-based approaches can achieve favorable quality-efficiency tradeoffs, while addressing threshold selection and biases in prior groundtruth evaluations. Theoretical results show TW can recover communities in SBM settings where Tectonic may fail, and empirical results on real and synthetic graphs validate practical gains in speed and memory efficiency. Overall, the work delivers a practical, scalable toolkit for motif-based clustering and provides guidance for threshold choice to enable robust real-world deployment.
Abstract
Community detection is a central task in graph analytics. Given the substantial growth in graph size, scalability in community detection continues to be an unresolved challenge. Recently, alongside established methods like Louvain and Infomap, motif-based community detection has emerged. Techniques like Tectonic are notable for their advanced ability to identify communities by pruning edges based on motif similarity scores and analyzing the resulting connected components. In this study, we perform a comprehensive evaluation of community detection methods, focusing on both the quality of their output and their scalability. Specifically, we contribute an open-source parallel framework for motif-based community detection based on a shared memory architecture. We conduct a thorough comparative analysis of community detection techniques from various families among state-of-the-art methods, including Tectonic, label propagation, spectral clustering, Louvain, LambdaCC, and Infomap on graphs with up to billions of edges. A key finding of our analysis is that motif-based graph clustering provides a good balance between performance and efficiency. Our work provides several novel insights. Interestingly, we pinpoint biases in prior works in evaluating community detection methods using the top 5K groundtruth communities from SNAP only, as these are frequently near-cliques. Our empirical studies lead to rules of thumb threshold picking strategies that can be critical for real applications. Finally, we show that Tectonic can fail to recover two well-separated clusters. To address this, we suggest a new similarity measure based on counts of triangles and wedges (TW) that prevents the over-segmentation of communities by Tectonic.
