Table of Contents
Fetching ...

Parallelizing Maximal Clique Enumeration on GPUs

Mohammad Almasri, Yen-Hsiang Chang, Izzat El Hajj, Rakesh Nagi, Jinjun Xiong, Wen-mei Hwu

TL;DR

This work targets exact maximal clique enumeration (MCE) on graphs by leveraging GPU parallelism with Bron-Kerbosch. It introduces per-block depth-first traversal of independent subtrees, a worker list for dynamic load balancing, partial induced subgraphs, and a compact two-part representation of the X sets to curb memory usage. The approach yields substantial speedups over state-of-the-art CPU implementations (up to 16.7x on modern GPUs) and demonstrates scalable multi-GPU performance, while keeping overheads for balancing and data management low. These contributions enable efficient, exact MCE at scale and provide an open-source implementation for benchmarking and further research.

Abstract

We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first traversal of independent subtrees in parallel. Since MCE suffers from high load imbalance and memory capacity requirements, we propose a worker list for dynamic load balancing, as well as partial induced subgraphs and a compact representation of excluded vertex sets to regulate memory consumption. Our evaluation shows that our GPU implementation on a single GPU outperforms the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x (up to 16.7x), and scales efficiently to multiple GPUs. Our code has been open-sourced to enable further research on accelerating MCE.

Parallelizing Maximal Clique Enumeration on GPUs

TL;DR

This work targets exact maximal clique enumeration (MCE) on graphs by leveraging GPU parallelism with Bron-Kerbosch. It introduces per-block depth-first traversal of independent subtrees, a worker list for dynamic load balancing, partial induced subgraphs, and a compact two-part representation of the X sets to curb memory usage. The approach yields substantial speedups over state-of-the-art CPU implementations (up to 16.7x on modern GPUs) and demonstrates scalable multi-GPU performance, while keeping overheads for balancing and data management low. These contributions enable efficient, exact MCE at scale and provide an open-source implementation for benchmarking and further research.

Abstract

We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first traversal of independent subtrees in parallel. Since MCE suffers from high load imbalance and memory capacity requirements, we propose a worker list for dynamic load balancing, as well as partial induced subgraphs and a compact representation of excluded vertex sets to regulate memory consumption. Our evaluation shows that our GPU implementation on a single GPU outperforms the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x (up to 16.7x), and scales efficiently to multiple GPUs. Our code has been open-sourced to enable further research on accelerating MCE.
Paper Structure (30 sections, 5 figures, 4 tables, 4 algorithms)

This paper contains 30 sections, 5 figures, 4 tables, 4 algorithms.

Figures (5)

  • Figure 1: Bron-Kerbosch algorithm variants applied to the example graph
  • Figure 2: Using a single array to represent $X_X$ across levels
  • Figure 3: Load distribution across streaming multiprocessors (SMs) for different combinations of optimizations
  • Figure 4: Strong scaling with respect to the number of GPUs for different combinations of optimizations
  • Figure 5: Breakdown and comparison of execution time for different combinations of optimizations