Beyond a Single Queue: Multi-Level-Multi-Queue as an Effective Design for SSSP problems on GPUs

Zhengding Hu; Jingwen Sun; Le Jiang; Yuhao Wang; Junqing Lin; Yi Zong; Guangzhong Sun

Beyond a Single Queue: Multi-Level-Multi-Queue as an Effective Design for SSSP problems on GPUs

Zhengding Hu, Jingwen Sun, Le Jiang, Yuhao Wang, Junqing Lin, Yi Zong, Guangzhong Sun

TL;DR

MLMQ addresses the inefficiencies of a single fixed GPU queue for SSSP by distributing work across a three-level queue hierarchy that leverages the GPU memory hierarchy. It introduces a cache-like collaboration mechanism and modular Read/Write primitives to compose diverse queue types, paired with an input-adaptive configuration that uses graph features and a learning model to select effective configurations. Empirical results show substantial speedups over state-of-the-art baselines across multiple GPUs and graph types, and additional validation on BFS demonstrates broader applicability. The work provides a robust, open-source framework for efficient graph processing on GPUs with potential extensions to broader algorithms and distributed setups.

Abstract

As one of the most fundamental problems in graph processing, the Single-Source Shortest Path (SSSP) problem plays a critical role in numerous application scenarios. However, existing GPU-based solutions remain inefficient, as they typically rely on a single, fixed queue design that incurs severe synchronization overhead, high memory latency, and poor adaptivity to diverse inputs. To address these inefficiencies, we propose MultiLevelMultiQueue (MLMQ), a novel data structure that distributes multiple queues across the GPU's multi-level parallelism and memory hierarchy. To realize MLMQ, we introduce a cache-like collaboration mechanism for efficient inter-queue coordination, and develop a modular queue design based on unified Read and Write primitives. Within this framework, we expand the optimization space by designing a set of GPU-friendly queues, composing them across multiple levels, and further providing an input-adaptive MLMQ configuration scheme. Our MLMQ design achieves average speedups of 1.87x to 17.13x over state-of-the-art implementations. Our code is open-sourced at https://github.com/Leo9660/MLMQ.git.

Beyond a Single Queue: Multi-Level-Multi-Queue as an Effective Design for SSSP problems on GPUs

TL;DR

Abstract

Paper Structure (24 sections, 14 figures, 4 tables)

This paper contains 24 sections, 14 figures, 4 tables.

Introduction
Background and Motivation
Single-Source Shortest Path
Single Queue Performance Bottlenecks
Challenges for Multiple Queues on GPUs
MLMQ: Design and Method
Multi-level Queue Structure
Cache-like Queue Collaboration
Modular Queue with Unified Primitives
Upper-level Queue Design
Input Adaptive Configuration
Implementation
MLMQ-based Kernel Implementation
Concurrent L2 Queue Implementation
Evaluation
...and 9 more sections

Figures (14)

Figure 1: Differences between single concurrent queue and MLMQ. $T_i$ refers to the $i$-th GPU thread.
Figure 2: Relaxing and enqueue/dequeue operations in SSSP. Different queues correspond to different algorithms.
Figure 3: Performance profiling with different queues on NVIDIA 3080 Ti. Results are normalized by the number of vertices, which corresponds to the optimal work, i.e., the number of relaxations in Dijkstra’s algorithm.
Figure 4: Performance and total work comparison when using local queues on a bucket queue.
Figure 5: An example of redundant wavefronts.
...and 9 more figures

Beyond a Single Queue: Multi-Level-Multi-Queue as an Effective Design for SSSP problems on GPUs

TL;DR

Abstract

Beyond a Single Queue: Multi-Level-Multi-Queue as an Effective Design for SSSP problems on GPUs

Authors

TL;DR

Abstract

Table of Contents

Figures (14)