Table of Contents
Fetching ...

BlockFIFO & MultiFIFO: Scalable Relaxed Queues

Stefan Koch, Peter Sanders, Marvin Williams

TL;DR

This work introduces two scalable relaxed concurrent FIFO queues, MultiFIFO and BlockFIFO, to overcome contention in strict FIFO queues. MultiFIFO adapts the MultiQueue design by using internal ring buffers with insertion timestamps, achieving constant-time operations and rank error linear in the number of threads $p$. BlockFIFO builds a lock-free structure from blocks and push/pop windows, enabling high throughput at the cost of larger rank errors, with practical enhancements like a bitset and lookahead windows to improve performance. Extensive evaluations across micro-benchmarks and BFS on diverse architectures demonstrate order-of-magnitude throughput gains over prior relaxed and strict queues, highlighting the practical impact for parallel graph processing and other throughput-centric workloads.

Abstract

FIFO queues are a fundamental data structure used in a wide range of applications. Concurrent FIFO queues allow multiple execution threads to access the queue simultaneously. Maintaining strict FIFO semantics in concurrent queues leads to low throughput due to high contention at the head and tail of the queue. By relaxing the FIFO semantics to allow some reordering of elements, it becomes possible to achieve much higher scalability. This work presents two orthogonal designs for relaxed concurrent FIFO queues, one derived from the MultiQueue and the other based on ring buffers. We evaluate both designs extensively on various micro-benchmarks and a breadth-first search application on large graphs. Both designs outperform state-of-the-art relaxed and strict FIFO queues, achieving higher throughput and better scalability.

BlockFIFO & MultiFIFO: Scalable Relaxed Queues

TL;DR

This work introduces two scalable relaxed concurrent FIFO queues, MultiFIFO and BlockFIFO, to overcome contention in strict FIFO queues. MultiFIFO adapts the MultiQueue design by using internal ring buffers with insertion timestamps, achieving constant-time operations and rank error linear in the number of threads . BlockFIFO builds a lock-free structure from blocks and push/pop windows, enabling high throughput at the cost of larger rank errors, with practical enhancements like a bitset and lookahead windows to improve performance. Extensive evaluations across micro-benchmarks and BFS on diverse architectures demonstrate order-of-magnitude throughput gains over prior relaxed and strict queues, highlighting the practical impact for parallel graph processing and other throughput-centric workloads.

Abstract

FIFO queues are a fundamental data structure used in a wide range of applications. Concurrent FIFO queues allow multiple execution threads to access the queue simultaneously. Maintaining strict FIFO semantics in concurrent queues leads to low throughput due to high contention at the head and tail of the queue. By relaxing the FIFO semantics to allow some reordering of elements, it becomes possible to achieve much higher scalability. This work presents two orthogonal designs for relaxed concurrent FIFO queues, one derived from the MultiQueue and the other based on ring buffers. We evaluate both designs extensively on various micro-benchmarks and a breadth-first search application on large graphs. Both designs outperform state-of-the-art relaxed and strict FIFO queues, achieving higher throughput and better scalability.

Paper Structure

This paper contains 20 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 5.1: Schematic diagram of the BlockFIFO with block size $C=4$ and window size $w=3$, represented as a linear array. The colored area represents the "active" part of the array. Slots with stronger colors contain elements. The line pattern indicates that a slot is emptied and not used again.
  • Figure 6.1: Various configurations of all competitors on the push-pop benchmark with different thread counts on machine AMD. For the BlockFIFO, the block factor $B$ ranges from 1.0 to 16.0 and the block size $C$ ranges from 7.0 to 2047.0. For the MultiFIFO, sub-queues per thread $c$ range from 2.0 to 8.0 and stickiness $s$ ranges from 1.0 to 4096.0. For the $k$-FIFO, segment size $k$ ranges from $\tfrac{1}{8}p$ to $64p$. For the $d$-CBO, the sub-queue count $c$ ranges from $\tfrac{1}{8}p$ to $8p$. For all ranges, we sample integer powers of two (minus one for the block size). Non-Pareto-optimal configurations of the BlockFIFO and MultiFIFO are shown with low opacity. Configurations used in further experiments are highlighted with circles and their name is annotated.
  • Figure 6.2: Throughput on the push-pop benchmark with different thread counts on all machines.
  • Figure 6.3: Throughput with different producer-consumer ratios at different thread counts.
  • Figure 6.4: Weak and strong scaling BFS benchmarks. Only the best-performing queue configurations for each competitor are shown for clarity. The upper part shows strong scaling behaviour on real-world graphs, where the dotted lines represent the execution time of a sequential BFS. The bottom part shows weak scaling behaviour, where the graph size scales with the number of threads. The black line represents the time required by a sequential BFS on the scaled graph.
  • ...and 10 more figures