Table of Contents
Fetching ...

Engineering MultiQueues: Fast Relaxed Concurrent Priority Queues

Marvin Williams, Peter Sanders

TL;DR

The paper introduces MultiQueue, a relaxed concurrent priority queue that distributes work across many internal PQs and uses a two-choice deletion policy to substantially reduce contention. With buffering, stickiness, and cache-aware design, it achieves near-sequential running times while maintaining bounded rank errors and delays that grow only linearly with the number of threads. Theoretical analysis shows probabilistic wait-freedom and $O(p)$ expected rank error (and $O(p\log p)$ tail) under reasonable assumptions; empirical evaluation demonstrates superior throughput and competitive quality across benchmarks such as shortest-path and knapsack, significantly outperforming linearizable PQs and many rivals. The approach provides a practical, configurable balance between throughput and correctness and suggests broad applicability to parallel priority-driven workloads, including potential GPU implementations.

Abstract

Priority queues are used in a wide range of applications, including prioritized online scheduling, discrete event simulation, and greedy algorithms. In parallel settings, classical priority queues often become a severe bottleneck, resulting in low throughput. Consequently, there has been significant interest in concurrent priority queues with relaxed semantics. In this article, we present the MultiQueue, a flexible approach to relaxed priority queues that uses multiple internal sequential priority queues. The scalability of the MultiQueue is enhanced by buffering elements, batching operations on the internal queues, and optimizing access patterns for high cache locality. We investigate the complementary quality criteria of rank error, which measures how close deleted elements are to the global minimum, and delay, which quantifies how many smaller elements were deleted before a given element. Extensive experimental evaluation shows that the MultiQueue outperforms competing approaches across several benchmarks. This includes shortest-path and branch-and-bound benchmarks that resemble real applications. Moreover, the MultiQueue can be configured easily to balance throughput and quality according to the application's requirements. We employ a seemingly paradoxical technique of wait-free locking that might be of broader interest for converting sequential data structures into relaxed concurrent data structures.

Engineering MultiQueues: Fast Relaxed Concurrent Priority Queues

TL;DR

The paper introduces MultiQueue, a relaxed concurrent priority queue that distributes work across many internal PQs and uses a two-choice deletion policy to substantially reduce contention. With buffering, stickiness, and cache-aware design, it achieves near-sequential running times while maintaining bounded rank errors and delays that grow only linearly with the number of threads. Theoretical analysis shows probabilistic wait-freedom and expected rank error (and tail) under reasonable assumptions; empirical evaluation demonstrates superior throughput and competitive quality across benchmarks such as shortest-path and knapsack, significantly outperforming linearizable PQs and many rivals. The approach provides a practical, configurable balance between throughput and correctness and suggests broad applicability to parallel priority-driven workloads, including potential GPU implementations.

Abstract

Priority queues are used in a wide range of applications, including prioritized online scheduling, discrete event simulation, and greedy algorithms. In parallel settings, classical priority queues often become a severe bottleneck, resulting in low throughput. Consequently, there has been significant interest in concurrent priority queues with relaxed semantics. In this article, we present the MultiQueue, a flexible approach to relaxed priority queues that uses multiple internal sequential priority queues. The scalability of the MultiQueue is enhanced by buffering elements, batching operations on the internal queues, and optimizing access patterns for high cache locality. We investigate the complementary quality criteria of rank error, which measures how close deleted elements are to the global minimum, and delay, which quantifies how many smaller elements were deleted before a given element. Extensive experimental evaluation shows that the MultiQueue outperforms competing approaches across several benchmarks. This includes shortest-path and branch-and-bound benchmarks that resemble real applications. Moreover, the MultiQueue can be configured easily to balance throughput and quality according to the application's requirements. We employ a seemingly paradoxical technique of wait-free locking that might be of broader interest for converting sequential data structures into relaxed concurrent data structures.

Paper Structure

This paper contains 33 sections, 2 theorems, 6 equations, 22 figures, 3 tables.

Key Result

theorem 1

The expected time for a thread to acquire a lock during the insert and delete operations is in $\mathop{\mathrm{\mathcal{O}}}\nolimits(1)$.

Figures (22)

  • Figure 1: Schematic view of the MultiQueue data structure with three threads, a queue factor of $c=2$ and $d=2$ deletion candidates. A gray square in front of a PQ indicates that the respective PQ is locked. Green and red arrows represent insertions and deletions, respectively. The dashed red line indicates the deletion candidate that is not deleted from.
  • Figure 2: Pseudocode for the insert and delete operations with $d=2$. The internal PQs are stored in array $A$. The min function returns a smallest element or $\bot$ if the PQ is empty.
  • Figure 3: Schematic view of the search tree $H$ with height $h$ with nodes in $H_{\leq}$ and $H_{>}$. The path from the root to the optimal solution $s$ is highlighted. The dashed lines bound the area of $H_{>}$ that are explored due to the delay.
  • Figure 4: Pseudocode for inserting into and deleting from a locked PQ $Q$ with insertion buffer $I$ and deletion buffer $D$. The buffers have capacities $C_I$ and $C_D$, respectively. The max and min operations return a largest and smallest element or $\bot$ if the set is empty, respectively. Refilling $D$ from $Q$ is done by iteratively deleting the smallest element from $Q$ and inserting it into $D$ until $D$ is full or $Q$ is empty.
  • Figure 5: Pseudocode for the atomically swapping an entry $i$ with another randomly chosen entry. The operation $\texttt{atomicExchange}(A,v)$ atomically reads the value at $A$ and sets it to $v$, the operation $\texttt{compareAndSwap}(A,e,x)$ atomically compares the value at $A$ with $e$ and sets it to $x$ if they are equal.
  • ...and 17 more figures

Theorems & Definitions (2)

  • theorem 1
  • corollary 1