Table of Contents
Fetching ...

Performance-Driven Optimization of Parallel Breadth-First Search

Marati Bhaskar, Raghavendra Kanakagiri

TL;DR

The problem addressed is the difficulty of efficiently implementing BFS in parallel on multicore systems due to irregular memory access, load imbalance, and synchronization overhead. The authors propose a suite of optimizations including BFS-NonAtomic distance updates, a simple BFS-Hybrid top-down/bottom-up switching heuristic based on frontier size, and BFS-VisitedBitmap for visited tracking, evaluated across two architectures and diverse graphs. Key contributions include showing safe non-atomic distance updates under level-synchronized BFS, a frontier-threshold switching rule, and cache-locality benefits from a bitmap visited set, with small-diameter graph speedups in the range $3$ to $8\times$ on one platform and $3$ to $10\times$ on another. The findings reveal strong architecture- and graph-dependent effects, underscoring memory-access patterns as a primary performance driver and providing practical, open-source-friendly strategies for HPC graph processing.

Abstract

Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a set of optimization strategies for parallel BFS on multicore systems, including hybrid traversal, bitmap-based visited set, and a novel non-atomic distance update mechanism. We evaluate these optimizations across two different architectures - a 24-core Intel Xeon platform and a 128-core AMD EPYC system - using a diverse set of synthetic and real-world graphs. Our results demonstrate that the effectiveness of optimizations varies significantly based on graph characteristics and hardware architecture. For small-diameter graphs, our hybrid BFS implementation achieves speedups of 3-8x on the Intel platform and $3-10\times$ on the AMD system compared to a conventional parallel BFS implementation. However, the performance of large-diameter graphs is more nuanced, with some of the optimizations showing varied performance across platforms including performance degradation in some cases.

Performance-Driven Optimization of Parallel Breadth-First Search

TL;DR

The problem addressed is the difficulty of efficiently implementing BFS in parallel on multicore systems due to irregular memory access, load imbalance, and synchronization overhead. The authors propose a suite of optimizations including BFS-NonAtomic distance updates, a simple BFS-Hybrid top-down/bottom-up switching heuristic based on frontier size, and BFS-VisitedBitmap for visited tracking, evaluated across two architectures and diverse graphs. Key contributions include showing safe non-atomic distance updates under level-synchronized BFS, a frontier-threshold switching rule, and cache-locality benefits from a bitmap visited set, with small-diameter graph speedups in the range to on one platform and to on another. The findings reveal strong architecture- and graph-dependent effects, underscoring memory-access patterns as a primary performance driver and providing practical, open-source-friendly strategies for HPC graph processing.

Abstract

Breadth-first search (BFS) is a fundamental graph algorithm that presents significant challenges for parallel implementation due to irregular memory access patterns, load imbalance and synchronization overhead. In this paper, we introduce a set of optimization strategies for parallel BFS on multicore systems, including hybrid traversal, bitmap-based visited set, and a novel non-atomic distance update mechanism. We evaluate these optimizations across two different architectures - a 24-core Intel Xeon platform and a 128-core AMD EPYC system - using a diverse set of synthetic and real-world graphs. Our results demonstrate that the effectiveness of optimizations varies significantly based on graph characteristics and hardware architecture. For small-diameter graphs, our hybrid BFS implementation achieves speedups of 3-8x on the Intel platform and on the AMD system compared to a conventional parallel BFS implementation. However, the performance of large-diameter graphs is more nuanced, with some of the optimizations showing varied performance across platforms including performance degradation in some cases.

Paper Structure

This paper contains 4 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Analysis of duplicates and frontier size across iterations.
  • Figure 2: Performance comparison of BFS optimizations on the SpeedCode platform.
  • Figure 3: Performance comparison of BFS optimizations on the AMD platform.