Table of Contents
Fetching ...

Parallel Cluster-BFS and Applications to Shortest Paths

Letong Wang, Guy Blelloch, Yan Gu, Yihan Sun

TL;DR

The problem that performs BFS computations from a cluster of sources is referred to as the cluster-BFS (or C-BFS), and the problem that performs BFS computations from a cluster of sources is referred to as the cluster-BFS.

Abstract

Breadth-first Search (BFS) is one of the most important graph processing subroutines, especially for computing the unweighted distance. Many applications may require running BFS from multiple sources. Sequentially, when running BFS on a cluster of nearby vertices, a known optimization is using bit-parallelism. Given a subset of vertices with size $k$ and the distance between any pair of them is no more than $d$, BFS can be applied to all of them in total work $O(dm(k/w+1))$, where $w$ is the length of a word in bits and $m$ is the number of edges. We will refer to this approach as cluster-BFS (C-BFS). Such an approach has been studied and shown effective both in theory and in practice in the sequential setting. However, it remains unknown how this can be combined with thread-level parallelism. In this paper, we focus on designing efficient parallel C-BFS based on BFS to answer unweighted distance queries. Our solution combines the strengths of bit-level parallelism and thread-level parallelism, and achieves significant speedup over the plain sequential solution. We also apply our algorithm to real-world applications. In particular, we identified another application (landmark-labeling for the approximate distance oracle) that can take advantage of parallel C-BFS. Under the same memory budget, our new solution improves accuracy and/or time on all the 18 tested graphs.

Parallel Cluster-BFS and Applications to Shortest Paths

TL;DR

The problem that performs BFS computations from a cluster of sources is referred to as the cluster-BFS (or C-BFS), and the problem that performs BFS computations from a cluster of sources is referred to as the cluster-BFS.

Abstract

Breadth-first Search (BFS) is one of the most important graph processing subroutines, especially for computing the unweighted distance. Many applications may require running BFS from multiple sources. Sequentially, when running BFS on a cluster of nearby vertices, a known optimization is using bit-parallelism. Given a subset of vertices with size and the distance between any pair of them is no more than , BFS can be applied to all of them in total work , where is the length of a word in bits and is the number of edges. We will refer to this approach as cluster-BFS (C-BFS). Such an approach has been studied and shown effective both in theory and in practice in the sequential setting. However, it remains unknown how this can be combined with thread-level parallelism. In this paper, we focus on designing efficient parallel C-BFS based on BFS to answer unweighted distance queries. Our solution combines the strengths of bit-level parallelism and thread-level parallelism, and achieves significant speedup over the plain sequential solution. We also apply our algorithm to real-world applications. In particular, we identified another application (landmark-labeling for the approximate distance oracle) that can take advantage of parallel C-BFS. Under the same memory budget, our new solution improves accuracy and/or time on all the 18 tested graphs.

Paper Structure

This paper contains 21 sections, 3 theorems, 2 equations, 7 figures, 7 tables, 4 algorithms.

Key Result

Corollary 3.1

On an unweighted graph, given a set $S$ of vertices with diameter no more than $d$, for any vertex $v\in V$, we have

Figures (7)

  • Figure 1: Performance comparison with existing work. We test the running time of BFSs from a cluster of 64 vertices. The baselines are Ligra shun2013ligra that only uses thread-level parallelism and AIY'12 akiba2012shortest that only uses bit-level parallelism. The numbers are geometric means across 18 graphs. Full results are shown in \ref{['table:microbenchmark']} and \ref{['fig:par_compare']}.
  • Figure 2: Illustration of bitwise representation. The batch set $S$ is $\{A,B,C,D\}$. 4-bit bit-subsets are used to represent subsets of $S$. $\Delta_{v}$ is the smallest shortest distance from any vertex in $S$ to $v$. The subset ${S}_{v}[i]$ is defined as $\{s\in S | \delta(s,v) = \Delta_{v} +i\}$.
  • Figure 3: Speedup of parallel Ligra BFSs and parallel C-BFS over the standard sequential BFS on cluster with size 64.$y$-axis is the speedup over sequential regular BFS in log-scale, higher is better. Each group of bars represents a graph, except the last group, which represents the average across all graphs. The numbers on the bar are the speedup of parallel algorithms over the standard sequential algorithm.
  • Figure 4: The scalability curve on different number of processors for C-BFS. The y-axis is the self speedup. The C-BFS running on one core is always 1. The x-axis is the number of cores. 96h represents 96 cores with hyperthreads.
  • Figure 5: The running time of C-BFS on various cluster diameter $d$. The $y$-axis shows the relative running time over $d=2$. The $x$-axis shows the cluster diameter $d$.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Corollary 3.1
  • Lemma 3.1
  • Theorem 3.1