Aggregating Funnels for Faster Fetch&Add and Queues
Younghun Roh, Yuanhao Wei, Eric Ruppert, Panagiota Fatourou, Siddhartha Jayanti, Julian Shun
TL;DR
The paper tackles contention bottlenecks in concurrent fetch-and-add operations by introducing Aggregating Funnels. It partitions updates into batches using $2m$ Aggregator objects with $m = \left\lfloor \sqrt{p} \right\rfloor$, so a single F&A on Main handles the batch while other threads compute their results from per-Aggregator state, reducing contention to $O(\sqrt{p})$. The authors prove strong linearizability, implement a Fetch&AddDirect fast path, provide overflow handling and memory management, and show substantial throughput gains. In microbenchmarks and in the LCRQ queue, Aggregating Funnels deliver up to 4x speedups over hardware F&A and Combining Funnels at high thread counts, with better fairness. These results demonstrate that software based Aggregating Funnels can substantially improve scalability of fetch-and-add based primitives and related concurrent data structures.
Abstract
Many concurrent algorithms require processes to perform fetch-and-add operations on a single memory location, which can be a hot spot of contention. We present a novel algorithm called Aggregating Funnels that reduces this contention by spreading the fetch-and-add operations across multiple memory locations. It aggregates fetch-and-add operations into batches so that the batch can be performed by a single hardware fetch-and-add instruction on one location and all operations in the batch can efficiently compute their results by performing a fetch-and-add instruction on a different location. We show experimentally that this approach achieves higher throughput than previous combining techniques, such as Combining Funnels, and is substantially more scalable than applying hardware fetch-and-add instructions on a single memory location. We show that replacing the fetch-and-add instructions in the fastest state-of-the-art concurrent queue by our Aggregating Funnels eliminates a bottleneck and greatly improves the queue's overall throughput.
