Big Atomics

Daniel Anderson; Guy E. Blelloch; Siddhartha Jayanti

Big Atomics

Daniel Anderson, Guy E. Blelloch, Siddhartha Jayanti

TL;DR

This work addresses the absence of efficient multiword atomic primitives by introducing Big Atomics—k-word registers supporting load, store, and CAS—and presents three lock-free designs (Cached-WaitFree, Cached-Memory-Efficient, and Cached-WaitFree-Writable) that balance speed, memory usage, and progress guarantees. The authors validate the approaches empirically, showing SeqLock excels under undersubscription but suffers with oversubscription, while the memory-efficient, lock-free variant maintains strong performance across workloads; HTM offers no clear advantage in practice. They further demonstrate practical value by building CacheHash, an inlined hash table that leverages big atomics to reduce indirection and improve performance against state-of-the-art open-source tables. Overall, the results position big atomics as a viable building block for concurrent data structures and high-performance hash tables, especially in oversubscribed environments.

Abstract

In this paper, we give theoretically and practically efficient implementations of Big Atomics, i.e., $k$-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations. While modern hardware supports $k = 1$ and sometimes $k = 2$ (e.g., double-width compare-and-swap in x86), our implementations support arbitrary $k$. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC). We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop. We then use them to develop an efficient concurrent hash table, as evidence of their utility. We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics). The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version. The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost. The results show that our approach of using big atomics in the design of hash tables is a promising direction.

Big Atomics

TL;DR

Abstract

In this paper, we give theoretically and practically efficient implementations of Big Atomics, i.e.,

-word linearizable registers that support the load, store, and compare-and-swap (CAS) operations. While modern hardware supports

and sometimes

(e.g., double-width compare-and-swap in x86), our implementations support arbitrary

. Big Atomics are useful in many applications, including atomic manipulation of tuples, version lists, and implementing load-linked/store-conditional (LL/SC). We design fast, lock-free implementations of big atomics based on a novel fast-path-slow-path approach we develop. We then use them to develop an efficient concurrent hash table, as evidence of their utility. We experimentally validate the approach by comparing a variety of implementations of big atomics under a variety of workloads (thread counts, load/store ratios, contention, oversubscription, and number of atomics). The experiments compare two of our lock-free variants with C++ std::atomic, a lock-based version, a version using sequence locks, and an indirect version. The results show that our approach is close to the fastest under all conditions and far outperforms others under oversubscription. We also compare our big atomics based concurrent hash table to a variety of other state-of-the-art hash tables that support arbitrary length keys and values, including implementations from Intel's TBB, Facebook's Folly, libcuckoo, and a recent release from Boost. The results show that our approach of using big atomics in the design of hash tables is a promising direction.

Paper Structure (32 sections, 3 theorems, 5 figures, 1 table)

This paper contains 32 sections, 3 theorems, 5 figures, 1 table.

Introduction
Experiments.
Our Contributions.
Preliminaries.
Prior and Related Work
Appications of big atomics.
Algorithms
Cached Wait-Free
The Load Operation.
The CAS Operation.
Bounds and Correctness.
Cached Memory Efficient
Uninstalling backup nodes after caching.
Re-caching until success.
Store.
...and 17 more sections

Key Result

Theorem 3.1

The big atomic object described in Algorithm alg:wait-free-load-cas has linearizable loads and CASes. Furthermore for $n$ big atomics each of size $k$, and $p$ processes, all operations take $O(k)$ time, and the total memory usage is $2nk + O(n + p(p+k))$.

Figures (5)

Figure 1: Throughput in billions of operations per second of our big atomic implementations and our CacheHash using those big atomic implementation strategies. The machine has 96 hardware threads. Experiments are on 10 Million elements, and with 50% reads (load or find), and 50% updates (cas or insert/delete). $z=0$ means the distribution is uniform.
Figure 2: Throughput in billions of operations per second for various big atomic implementations across varying thread counts $(p)$, update frequencies $(u)$, contention parameters $(z)$, table sizes $(n)$, and element size measured in number of words $(k)$.
Figure 3: Throughput in billions of operations per second for our CacheHash hashtable implementations with big atomics and a separate-chaining baseline that does not use big atomics across varying thread counts $(p)$, update frequencies $(u)$, contention parameters $(z)$ and table sizes $(n)$.
Figure 4: Throughput in billions of operations per second for two of our CacheHash hashtables versus existing open-source concurrent hashtables across varying thread counts $(p)$ and contention parameters $(z)$.
Figure 5: Throughput in billions of operations per second for various big atomic implementations including hardware transaction memory (HTM) on an older four-socket machine across varying thread counts $(p)$, contention $(z)$, update frequency $(u)$, and table size $(n)$.

Theorems & Definitions (6)

Theorem 3.1
proof : Proof Sketch
Theorem 3.2
proof : Proof Sketch
Theorem 3.3
proof : Proof sketch

Big Atomics

TL;DR

Abstract

Big Atomics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)