Parallelizing Maximal Clique Enumeration on GPUs

Mohammad Almasri; Yen-Hsiang Chang; Izzat El Hajj; Rakesh Nagi; Jinjun Xiong; Wen-mei Hwu

Parallelizing Maximal Clique Enumeration on GPUs

Mohammad Almasri, Yen-Hsiang Chang, Izzat El Hajj, Rakesh Nagi, Jinjun Xiong, Wen-mei Hwu

TL;DR

This work targets exact maximal clique enumeration (MCE) on graphs by leveraging GPU parallelism with Bron-Kerbosch. It introduces per-block depth-first traversal of independent subtrees, a worker list for dynamic load balancing, partial induced subgraphs, and a compact two-part representation of the X sets to curb memory usage. The approach yields substantial speedups over state-of-the-art CPU implementations (up to 16.7x on modern GPUs) and demonstrates scalable multi-GPU performance, while keeping overheads for balancing and data management low. These contributions enable efficient, exact MCE at scale and provide an open-source implementation for benchmarking and further research.

Abstract

We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first traversal of independent subtrees in parallel. Since MCE suffers from high load imbalance and memory capacity requirements, we propose a worker list for dynamic load balancing, as well as partial induced subgraphs and a compact representation of excluded vertex sets to regulate memory consumption. Our evaluation shows that our GPU implementation on a single GPU outperforms the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x (up to 16.7x), and scales efficiently to multiple GPUs. Our code has been open-sourced to enable further research on accelerating MCE.

Parallelizing Maximal Clique Enumeration on GPUs

TL;DR

Abstract

Paper Structure (30 sections, 5 figures, 4 tables, 4 algorithms)

This paper contains 30 sections, 5 figures, 4 tables, 4 algorithms.

Introduction
Background
Maximal Clique Enumeration
Bron-Kerbosch
Bron-Kerbosch with Pivoting
Bron-Kerbosch with Other Optimizations
Parallelizing MCE on GPUs
Challenges and Implementation Overview
Independent Second-level Subtrees
Dynamic Load Balancing with a Worker List
Partial Induced Subgraphs
Compact Representation of the X Sets
Evaluation
Methodology
Performance
...and 15 more sections

Figures (5)

Figure 1: Bron-Kerbosch algorithm variants applied to the example graph
Figure 2: Using a single array to represent $X_X$ across levels
Figure 3: Load distribution across streaming multiprocessors (SMs) for different combinations of optimizations
Figure 4: Strong scaling with respect to the number of GPUs for different combinations of optimizations
Figure 5: Breakdown and comparison of execution time for different combinations of optimizations

Parallelizing Maximal Clique Enumeration on GPUs

TL;DR

Abstract

Parallelizing Maximal Clique Enumeration on GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)