Table of Contents
Fetching ...

$ν$-LPA: Fast GPU-based Label Propagation Algorithm (LPA) for Community Detection

Subhajit Sahu

TL;DR

This report presents an optimized implementation of the Label Propagation Algorithm for community detection, featuring an asynchronous LPA with a Pick-Less method every 4 iterations to handle community swaps, ideal for SIMT hardware like GPUs.

Abstract

Community detection is the problem of identifying natural divisions in networks. Efficient parallel algorithms for identifying such divisions are critical in a number of applications. This report presents an optimized implementation of the Label Propagation Algorithm (LPA) for community detection, featuring an asynchronous LPA with a Pick-Less (PL) method every 4 iterations to handle community swaps, ideal for SIMT hardware like GPUs. It also introduces a novel per-vertex hashtable with hybrid quadratic-double probing for collision resolution. On an NVIDIA A100 GPU, our implementation, $ν$-LPA, outperforms FLPA (sequential), NetworKit LPA (multicore), Gunrock LPA (GPU), and cuGraph Louvain (GPU) by 364x, 62x, 2.6x, and 37x, respectively, while running FLPA and NetworKit LPA on a server with dual 16-core Intel Xeon Gold 6226R processors - processing 3.0B edges/s on a 2.2B edge graph - and achieves 4.7% higher modularity than FLPA, but 6.1% and 9.6% lower than NetworKit LPA and cuGraph Louvain.

$ν$-LPA: Fast GPU-based Label Propagation Algorithm (LPA) for Community Detection

TL;DR

This report presents an optimized implementation of the Label Propagation Algorithm for community detection, featuring an asynchronous LPA with a Pick-Less method every 4 iterations to handle community swaps, ideal for SIMT hardware like GPUs.

Abstract

Community detection is the problem of identifying natural divisions in networks. Efficient parallel algorithms for identifying such divisions are critical in a number of applications. This report presents an optimized implementation of the Label Propagation Algorithm (LPA) for community detection, featuring an asynchronous LPA with a Pick-Less (PL) method every 4 iterations to handle community swaps, ideal for SIMT hardware like GPUs. It also introduces a novel per-vertex hashtable with hybrid quadratic-double probing for collision resolution. On an NVIDIA A100 GPU, our implementation, -LPA, outperforms FLPA (sequential), NetworKit LPA (multicore), Gunrock LPA (GPU), and cuGraph Louvain (GPU) by 364x, 62x, 2.6x, and 37x, respectively, while running FLPA and NetworKit LPA on a server with dual 16-core Intel Xeon Gold 6226R processors - processing 3.0B edges/s on a 2.2B edge graph - and achieves 4.7% higher modularity than FLPA, but 6.1% and 9.6% lower than NetworKit LPA and cuGraph Louvain.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Relative Runtime and Modularity of obtained communities with different community swap prevention techniques --- these include cross-checking and reverting bad community swaps (CC) every $1$ to $4$ iterations, enforcing picking/selection of only a label with a lower ID value (PL) every $1$ to $4$ iterations, and a hybrid of the two techniques (H) performed every $1$ to $4$ iterations.
  • Figure 2: Illustration of per-vertex open-addressing hashtables for our GPU implementation of LPA. Each vertex $i$ has a hashtable $H$ with a keys array $H_k$ and a values array $H_v$. Memory for all vertices' hash key and value arrays is allocated together. The hashtable's offset for vertex $i$ is $2O_i$, where $O_i$ is its CSR offset. The memory reserved for the hashtable is $2D_i$, with $D_i$ being the vertex's degree. The hashtable's capacity, or maximum key-value pairs, is $nextPow2(D_i) - 1$.
  • Figure 3: Relative Runtime with using Linear probing, Quadriatic probing, Double hashing, and a hybrid of Quadriatic probing and Double hashing (Quadriatic-double) for collision resolution in the per-vertex hashtables.
  • Figure 4: Relative Runtime with various switch degrees, i.e., switching point by vertex degree between the thread-per-vertex kernel vs. block-per-vertex kernel, ranging from $2$ to $256$. Vertices with degree lower than the switch degree are processed by the thread-per-vertex kernel, while the remaining vertices are processed by the block-per-vertex kernel (the vertices are partitioned accordingly).
  • Figure 5: Relative Runtime with using 32-bit floating point values (Float) compared to 64-bit floating point values (Double) for the aggregated weights (values) in the hashtable.
  • ...and 1 more figures