Table of Contents
Fetching ...

Memory Efficient GPU-based Label Propagation Algorithm (LPA) for Community Detection on Large Graphs

Subhajit Sahu

TL;DR

The paper addresses the high memory demands of GPU-based LPA for large graphs by introducing memory-efficient variants that replace per-thread hashtables with weighted MG and BM sketches. The proposed νMG8-LPA and νBM-LPA reduce working-set size to achieve $O(|V|)$ space while maintaining competitive time complexity $O(K|E|)$ and acceptable modularity losses. Across large SuiteSparse graphs, these methods dramatically cut memory usage (up to $98\times$) and deliver substantial speedups relative to prior GPU/CPU LPA implementations, with νMG8-LPA showing strong performance on web and social graphs. This work enables scalable, memory-conscious community detection on shared-memory GPUs, with practical implications for handling graphs with billions of edges.

Abstract

Community detection involves grouping nodes in a graph with dense connections within groups, than between them. We previously proposed efficient multicore (GVE-LPA) and GPU-based ($ν$-LPA) implementations of Label Propagation Algorithm (LPA) for community detection. However, these methods incur high memory overhead due to their per-thread/per-vertex hashtables. This makes it challenging to process large graphs on shared memory systems. In this report, we introduce memory-efficient GPU-based LPA implementations, using weighted Boyer-Moore (BM) and Misra-Gries (MG) sketches. Our new implementation, $ν$MG8-LPA, using an 8-slot MG sketch, reduces memory usage by 98x and 44x compared to GVE-LPA and $ν$-LPA, respectively. It is also 2.4x faster than GVE-LPA and only 1.1x slower than $ν$-LPA, with minimal quality loss (4.7%/2.9% drop compared to GVE-LPA/$ν$-LPA).

Memory Efficient GPU-based Label Propagation Algorithm (LPA) for Community Detection on Large Graphs

TL;DR

The paper addresses the high memory demands of GPU-based LPA for large graphs by introducing memory-efficient variants that replace per-thread hashtables with weighted MG and BM sketches. The proposed νMG8-LPA and νBM-LPA reduce working-set size to achieve space while maintaining competitive time complexity and acceptable modularity losses. Across large SuiteSparse graphs, these methods dramatically cut memory usage (up to ) and deliver substantial speedups relative to prior GPU/CPU LPA implementations, with νMG8-LPA showing strong performance on web and social graphs. This work enables scalable, memory-conscious community detection on shared-memory GPUs, with practical implications for handling graphs with billions of edges.

Abstract

Community detection involves grouping nodes in a graph with dense connections within groups, than between them. We previously proposed efficient multicore (GVE-LPA) and GPU-based (-LPA) implementations of Label Propagation Algorithm (LPA) for community detection. However, these methods incur high memory overhead due to their per-thread/per-vertex hashtables. This makes it challenging to process large graphs on shared memory systems. In this report, we introduce memory-efficient GPU-based LPA implementations, using weighted Boyer-Moore (BM) and Misra-Gries (MG) sketches. Our new implementation, MG8-LPA, using an 8-slot MG sketch, reduces memory usage by 98x and 44x compared to GVE-LPA and -LPA, respectively. It is also 2.4x faster than GVE-LPA and only 1.1x slower than -LPA, with minimal quality loss (4.7%/2.9% drop compared to GVE-LPA/-LPA).

Paper Structure

This paper contains 31 sections, 3 equations, 7 figures, 1 table, 5 algorithms.

Figures (7)

  • Figure 1: Illustration of per-vertex open-addressing hashtables in $\nu$-LPA sahu2024nulpa. Each vertex $i$ has a hashtable $H$ with a key array $H_k$ and a value array $H_v$. Memory for all hash key and value arrays is allocated together. The offset for vertex $i$'s hashtable is $2O_i$, where $O_i$ is its CSR offset. The total memory for the hashtable is $2D_i$, where $D_i$ is the vertex's degree. The hashtable’s capacity is $nextPow2(D_i) - 1$.
  • Figure 2: Relative runtime and Modularity of obtained communities using $\nu$MG-LPA, with varying number of slots $k$ in the Misra-Gries (MG) sketch, ranging from $2$ to $32$.
  • Figure 3: Relative Runtime of Shared variables and Warp-vote approaches for populating weighted Misra-Gries (MG) sketches from the neighborhood of each vertex.
  • Figure 4: Relative Runtime of Shared sketch and Partial sketches approaches for populating weighted Misra-Gries (MG) sketches from the neighborhood of each vertex.
  • Figure 5: Relative Runtime of Single scan vs. Double scan approaches for selecting the updated label of each vertex.
  • ...and 2 more figures