Table of Contents
Fetching ...

High-Performance Parallelization of Dijkstra's Algorithm Using MPI and CUDA

Boyang Song

TL;DR

This work tackles accelerating shortest-path computation on large graphs by implementing and comparing three versions of Dijkstra's algorithm: a serial baseline, an MPI-based parallel version, and a CUDA-based parallel version, all using a common adjacency-matrix representation. The study reports substantial speedups with parallel approaches—approximately $5\times$ for MPI and over $10\times$ for CUDA relative to the serial implementation—while highlighting the persistent challenges of synchronization overhead and the memory costs of adjacency matrices in large-scale graphs. It systematically evaluates performance across diverse graph sizes and densities, noting that communication and load-balancing limitations constrain scalability in MPI, whereas GPU parallelism offers strong gains when data transfer and kernel efficiency are optimized. The results provide practical guidance for HPC implementations of parallel shortest-path computations and emphasize the trade-offs between CPU-based MPI and GPU-based CUDA for graph analytics.

Abstract

This paper investigates the parallelization of Dijkstra's algorithm for computing the shortest paths in large-scale graphs using MPI and CUDA. The primary hypothesis is that by leveraging parallel computing, the computation time can be significantly reduced compared to a serial implementation. To validate this, I implemented three versions of the algorithm: a serial version, an MPI-based parallel version, and a CUDA-based parallel version. Experimental results demonstrate that the MPI implementation achieves over 5x speedup, while the CUDA implementation attains more than 10x improvement relative to the serial benchmark. However, the study also reveals inherent challenges in parallelizing Dijkstra's algorithm, including its sequential logic and significant synchronization overhead. Furthermore, the use of an adjacency matrix as the data structure is examined, highlighting its impact on memory consumption and performance in both dense and sparse graphs.

High-Performance Parallelization of Dijkstra's Algorithm Using MPI and CUDA

TL;DR

This work tackles accelerating shortest-path computation on large graphs by implementing and comparing three versions of Dijkstra's algorithm: a serial baseline, an MPI-based parallel version, and a CUDA-based parallel version, all using a common adjacency-matrix representation. The study reports substantial speedups with parallel approaches—approximately for MPI and over for CUDA relative to the serial implementation—while highlighting the persistent challenges of synchronization overhead and the memory costs of adjacency matrices in large-scale graphs. It systematically evaluates performance across diverse graph sizes and densities, noting that communication and load-balancing limitations constrain scalability in MPI, whereas GPU parallelism offers strong gains when data transfer and kernel efficiency are optimized. The results provide practical guidance for HPC implementations of parallel shortest-path computations and emphasize the trade-offs between CPU-based MPI and GPU-based CUDA for graph analytics.

Abstract

This paper investigates the parallelization of Dijkstra's algorithm for computing the shortest paths in large-scale graphs using MPI and CUDA. The primary hypothesis is that by leveraging parallel computing, the computation time can be significantly reduced compared to a serial implementation. To validate this, I implemented three versions of the algorithm: a serial version, an MPI-based parallel version, and a CUDA-based parallel version. Experimental results demonstrate that the MPI implementation achieves over 5x speedup, while the CUDA implementation attains more than 10x improvement relative to the serial benchmark. However, the study also reveals inherent challenges in parallelizing Dijkstra's algorithm, including its sequential logic and significant synchronization overhead. Furthermore, the use of an adjacency matrix as the data structure is examined, highlighting its impact on memory consumption and performance in both dense and sparse graphs.

Paper Structure

This paper contains 19 sections, 4 equations, 6 figures, 4 tables, 7 algorithms.

Figures (6)

  • Figure 1: Undirected graph
  • Figure 2: Performance (linear scale)
  • Figure 3: Performance (log scale)
  • Figure 4: Performance (linear scale)
  • Figure 5: Performance (log scale)
  • ...and 1 more figures