Large-Scale Graphs Community Detection using Spark GraphFrames
Elena-Simona Apostol, Adrian-Cosmin Cojocaru, Ciprian-Octavian Truică
TL;DR
Addresses scalable community detection on large graphs by leveraging a Spark GraphFrames framework. The approach implements three modularity-based algorithms—Louvain, Fast Greedy, and K-Cliques—in GraphFrames and evaluates them on Twitter and Research Collaborations datasets. It demonstrates that the framework delivers near-linear scalability and competitive time performance, with Fast Greedy typically faster than Louvain and K-Cliques enabling overlapping communities, as the methods optimize modularity functions $Q$ (undirected) and $Q_d$ (directed). This work confirms the practicality of distributed graph mining on GraphFrames and points to future directions such as Infomap and Infomod to broaden the analysis toolbox.
Abstract
With the emergence of social networks, online platforms dedicated to different use cases, and sensor networks, the emergence of large-scale graph community detection has become a steady field of research with real-world applications. Community detection algorithms have numerous practical applications, particularly due to their scalability with data size. Nonetheless, a notable drawback of community detection algorithms is their computational intensity~\cite{Apostol2014}, resulting in decreasing performance as data size increases. For this purpose, new frameworks that employ distributed systems such as Apache Hadoop and Apache Spark which can seamlessly handle large-scale graphs must be developed. In this paper, we propose a novel framework for community detection algorithms, i.e., K-Cliques, Louvain, and Fast Greedy, developed using Apache Spark GraphFrames. We test their performance and scalability on two real-world datasets. The experimental results prove the feasibility of developing graph mining algorithms using Apache Spark GraphFrames.
