Table of Contents
Fetching ...

Overcoming Latency-bound Limitations of Distributed Graph Algorithms using the HPX Runtime System

Karame Mohammadiporshokooh, Panagiotis Syskakis, Andrew Lumsdaine, Hartmut Kaiser

TL;DR

This work presents a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting - using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs.

Abstract

Graph processing at scale presents many challenges, including the irregular structure of graphs, the latency-bound nature of graph algorithms, and the overhead associated with distributed execution. While existing frameworks such as Spark GraphX and the Parallel Boost Graph Library (PBGL) have introduced abstractions for distributed graph processing, they continue to struggle with inherent issues like load imbalance and synchronization overhead. In this work, we present a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting, using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs. These algorithms span the categories of Traversal, centrality, and Pattern matching, and are selected to represent diverse computational characteristics. We evaluate our HPX-based implementations against GraphX, and PBGL, showing that a high-performance runtime such as HPX enables the construction of algorithms that significantly outperform conventional frameworks by exploiting asynchronous execution, latency hiding, and fine-grained parallelism in shared memory. All algorithms in our prototype follow a unified execution model in which local and remote computations are expressed using the same programming abstractions, with asynchrony managed transparently by the runtime. This design explicitly leverages shared-memory parallelism within each locality while overlapping communication and computation across localities, providing a practical foundation for extending this approach to a broader class of distributed graph algorithms.

Overcoming Latency-bound Limitations of Distributed Graph Algorithms using the HPX Runtime System

TL;DR

This work presents a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting - using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs.

Abstract

Graph processing at scale presents many challenges, including the irregular structure of graphs, the latency-bound nature of graph algorithms, and the overhead associated with distributed execution. While existing frameworks such as Spark GraphX and the Parallel Boost Graph Library (PBGL) have introduced abstractions for distributed graph processing, they continue to struggle with inherent issues like load imbalance and synchronization overhead. In this work, we present a distributed library prototype and a distributed implementation of three key graph algorithms - Breadth-First Search (BFS), PageRank, and Triangle Counting, using C++ mechanisms from the NWgraph library and leveraging HPX's distributed containers and asynchronous constructs. These algorithms span the categories of Traversal, centrality, and Pattern matching, and are selected to represent diverse computational characteristics. We evaluate our HPX-based implementations against GraphX, and PBGL, showing that a high-performance runtime such as HPX enables the construction of algorithms that significantly outperform conventional frameworks by exploiting asynchronous execution, latency hiding, and fine-grained parallelism in shared memory. All algorithms in our prototype follow a unified execution model in which local and remote computations are expressed using the same programming abstractions, with asynchrony managed transparently by the runtime. This design explicitly leverages shared-memory parallelism within each locality while overlapping communication and computation across localities, providing a practical foundation for extending this approach to a broader class of distributed graph algorithms.
Paper Structure (14 sections, 2 equations, 4 figures, 1 table)

This paper contains 14 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Structure of a distributed graph. Each gray box represents a separate locality with its own address space. As is common in many distributed graph-based applications, edges connect from most localities to most others.
  • Figure 2: Strong scaling behavior of distributed graph algorithms on Erdös-Rényi uniformly random (urand) graphs.
  • Figure 3: Per-node memory usage comparison using 4 MPI processes (PBGL) / 4 worker threads (HPX) per node.
  • Figure 4: Strong scaling against Spark GraphX implementations, including GAP-kron and GAP-urand graphs ($\sim$128M vertices and $\sim$4B edges).