Table of Contents
Fetching ...

CPU vs. GPU for Community Detection: Performance Insights from GVE-Louvain and $ν$-Louvain

Subhajit Sahu

TL;DR

The paper introduces GVE-Louvain, a highly optimized multicore CPU implementation of the Louvain method for community detection, and ν-Louvain, a GPU-based variant. GVE-Louvain achieves substantial speedups over state-of-the-art CPU and GPU baselines and reaches 560M edges per second on a 3.8B-edge graph, with strong scaling across additional cores. ν-Louvain performs competitively but generally does not surpass GVE-Louvain, largely due to diminished workload and parallelism in the algorithm’s later passes, highlighting the advantage of CPU flexibility for irregular workloads. Overall, the findings suggest CPUs offer superior practicality and energy efficiency for large-scale community detection tasks, though GPU approaches with careful design can still be effective for the early, highly parallel phases.

Abstract

Community detection involves identifying natural divisions in networks, a crucial task for many large-scale applications. This report presents GVE-Louvain, one of the most efficient multicore implementations of the Louvain algorithm, a high-quality method for community detection. Running on a dual 16-core Intel Xeon Gold 6226R server, GVE-Louvain outperforms Vite, Grappolo, NetworKit Louvain, and cuGraph Louvain (on an NVIDIA A100 GPU) by factors of 50x, 22x, 20x, and 5.8x, respectively, achieving a processing rate of 560M edges per second on a 3.8B-edge graph. Additionally, it scales efficiently, improving performance by 1.6x for every thread doubling. The paper also presents $ν$-Louvain, a GPU-based implementation. When evaluated on an NVIDIA A100 GPU, $ν$-Louvain performs only on par with GVE-Louvain, largely due to reduced workload and parallelism in later algorithmic passes. These results suggest that CPUs, with their flexibility in handling irregular workloads, may be better suited for community detection tasks.

CPU vs. GPU for Community Detection: Performance Insights from GVE-Louvain and $ν$-Louvain

TL;DR

The paper introduces GVE-Louvain, a highly optimized multicore CPU implementation of the Louvain method for community detection, and ν-Louvain, a GPU-based variant. GVE-Louvain achieves substantial speedups over state-of-the-art CPU and GPU baselines and reaches 560M edges per second on a 3.8B-edge graph, with strong scaling across additional cores. ν-Louvain performs competitively but generally does not surpass GVE-Louvain, largely due to diminished workload and parallelism in the algorithm’s later passes, highlighting the advantage of CPU flexibility for irregular workloads. Overall, the findings suggest CPUs offer superior practicality and energy efficiency for large-scale community detection tasks, though GPU approaches with careful design can still be effective for the early, highly parallel phases.

Abstract

Community detection involves identifying natural divisions in networks, a crucial task for many large-scale applications. This report presents GVE-Louvain, one of the most efficient multicore implementations of the Louvain algorithm, a high-quality method for community detection. Running on a dual 16-core Intel Xeon Gold 6226R server, GVE-Louvain outperforms Vite, Grappolo, NetworKit Louvain, and cuGraph Louvain (on an NVIDIA A100 GPU) by factors of 50x, 22x, 20x, and 5.8x, respectively, achieving a processing rate of 560M edges per second on a 3.8B-edge graph. Additionally, it scales efficiently, improving performance by 1.6x for every thread doubling. The paper also presents -Louvain, a GPU-based implementation. When evaluated on an NVIDIA A100 GPU, -Louvain performs only on par with GVE-Louvain, largely due to reduced workload and parallelism in later algorithmic passes. These results suggest that CPUs, with their flexibility in handling irregular workloads, may be better suited for community detection tasks.

Paper Structure

This paper contains 48 sections, 2 equations, 17 figures, 2 tables, 7 algorithms.

Figures (17)

  • Figure 1: Illustration of vertex pruning optimization: After processing vertex $1$, it's unmarked. If vertex $1$ changes its community, its neighbors are marked for processing. Community membership of each vertex is depicted by border color, and marked vertices are highlighted in yellow.
  • Figure 2: Impact of various parameter controls and optimizations on the runtime and result quality (modularity) of the Louvain algorithm, with the left/right Y-axes showing the effect of each optimization on relative runtime/modularity, respectively.
  • Figure 3: Illustration of collision-free per-thread hashtables that are well separated in their memory addresses (Far-KV), for two threads. Each hashtable consists of a keys vector, a values vector of size $|V|$, and a key count ($N_0$ or $N_1$). The value associated with each key is stored or accumulated at the index pointed to by the key. To prevent false cache sharing, the key count for each hashtable is independently updated and allocated separately on the heap. These hashtables are utilized during the local-moving and aggregation phases of our multicore Louvain implementation, GVE-Louvain.
  • Figure 4: A flow diagram illustrating the first pass of GVE-Louvain for a Weighted 2D-vector based or a Weighted CSR with degree based input graph. In the local-moving phase, vertex community memberships are updated until the total change in delta-modularity across all vertices reaches a specified threshold. Community memberships are then counted and renumbered. In the aggregation phase, community vertices in a CSR are first obtained. This is used to create the super-vertex graph stored in a Weighted Holey CSR with degree. In subsequent passes, the input is a Weighted Holey CSR with degree and initial community membership for super-vertices from the previous pass.
  • Figure 5: Relative Runtime and Modularity of the communities obtained using the Pick-Less (PL) community swap prevention technique, which restricts label selection to those with lower ID values every $2$ to $16$ iterations.
  • ...and 12 more figures