Engineering MultiQueues: Fast Relaxed Concurrent Priority Queues
Marvin Williams, Peter Sanders
TL;DR
The paper introduces MultiQueue, a relaxed concurrent priority queue that distributes work across many internal PQs and uses a two-choice deletion policy to substantially reduce contention. With buffering, stickiness, and cache-aware design, it achieves near-sequential running times while maintaining bounded rank errors and delays that grow only linearly with the number of threads. Theoretical analysis shows probabilistic wait-freedom and $O(p)$ expected rank error (and $O(p\log p)$ tail) under reasonable assumptions; empirical evaluation demonstrates superior throughput and competitive quality across benchmarks such as shortest-path and knapsack, significantly outperforming linearizable PQs and many rivals. The approach provides a practical, configurable balance between throughput and correctness and suggests broad applicability to parallel priority-driven workloads, including potential GPU implementations.
Abstract
Priority queues are used in a wide range of applications, including prioritized online scheduling, discrete event simulation, and greedy algorithms. In parallel settings, classical priority queues often become a severe bottleneck, resulting in low throughput. Consequently, there has been significant interest in concurrent priority queues with relaxed semantics. In this article, we present the MultiQueue, a flexible approach to relaxed priority queues that uses multiple internal sequential priority queues. The scalability of the MultiQueue is enhanced by buffering elements, batching operations on the internal queues, and optimizing access patterns for high cache locality. We investigate the complementary quality criteria of rank error, which measures how close deleted elements are to the global minimum, and delay, which quantifies how many smaller elements were deleted before a given element. Extensive experimental evaluation shows that the MultiQueue outperforms competing approaches across several benchmarks. This includes shortest-path and branch-and-bound benchmarks that resemble real applications. Moreover, the MultiQueue can be configured easily to balance throughput and quality according to the application's requirements. We employ a seemingly paradoxical technique of wait-free locking that might be of broader interest for converting sequential data structures into relaxed concurrent data structures.
