Are Your Epochs Too Epic? Batch Free Can Be Harmful

Daewoo Kim; Trevor Brown; Ajay Singh

Are Your Epochs Too Epic? Batch Free Can Be Harmful

Daewoo Kim, Trevor Brown, Ajay Singh

TL;DR

This paper identifies a subtle but impactful interaction between epoch-based memory reclamation (EBR) and modern allocators that frees large batches, causing remote batch free (RBF) overhead and severe contention on multi-socket systems. It introduces amortized freeing (AF) to smooth large frees over time, demonstrates its effectiveness across ten SMR algorithms, and develops a simple Amortized-free Token-EBR variant that outperforms state-of-the-art reclamation approaches on a 192-thread system. The authors also introduce timeline graphs to visualize thread activity and reclamation latency, enabling clearer diagnosis of bottlenecks. The results show up to 2.6x speedups over the fastest existing reclamation method and substantial improvements even for commonly used allocators like JEmalloc and TCmalloc, with MImalloc largely immune to the problem. Overall, the work provides practical techniques to improve memory reclamation performance in highly concurrent environments and offers new insights into allocator–reclamation interactions that can guide future design.

Abstract

Epoch based memory reclamation (EBR) is one of the most popular techniques for reclaiming memory in lock-free and optimistic locking data structures, due to its ease of use and good performance in practice. However, EBR is known to be sensitive to thread delays, which can result in performance degradation. Moreover, the exact mechanism for this performance degradation is not well understood. This paper illustrates this performance degradation in a popular data structure benchmark, and does a deep dive to uncover its root cause-a subtle interaction between EBR and state of the art memory allocators. In essence, modern allocators attempt to reduce the overhead of freeing by maintaining bounded thread caches of objects for local reuse, actually freeing them (a very high latency operation) only when thread caches become too large. EBR immediately bypasses these mechanisms whenever a particularly large batch of objects is freed, substantially increasing overheads and latencies. Beyond EBR, many memory reclamation algorithms, and data structures, that reclaim objects in large batches suffer similar deleterious interactions with popular allocators. We propose a simple algorithmic fix for such algorithms to amortize the freeing of large object batches over time, and apply this technique to ten existing memory reclamation algorithms, observing performance improvements for nine out of ten, and over 50% improvement for six out of ten in experiments on a high performance lock-free ABtree. We also present an extremely simple token passing variant of EBR and show that, with our fix, it performs 1.5-2.6x faster than the fastest known memory reclamation algorithm, and 1.2-1.5x faster than not reclaiming at all, on a 192 thread four socket Intel system.

Are Your Epochs Too Epic? Batch Free Can Be Harmful

TL;DR

Abstract

Paper Structure (41 sections, 29 figures, 4 tables)

This paper contains 41 sections, 29 figures, 4 tables.

Introduction
Background
Diagnosing the RBF Problem
Experimental Methodology
System
Symptom: poor scaling on NUMA
Hypothesis: memory reclamation is a bottleneck
Investigating the Reclamation Bottleneck
Timeline Graphs
Comparing Timelines: 96 vs 192 Threads
Root Cause of Long Reclamation Events
A Simple Solution: Amortized Free
Effect on the Overhead of Reclamation
Effect on Unreclaimed Garbage
These Results Generalize to TCmalloc
...and 26 more sections

Figures (29)

Figure 1: Performance (operations per second) and peak memory usage with JEmalloc, for OCCtree and ABtree, using DEBRA (upper) vs leaking memory (lower).
Figure 2: Timeline graphs showing how much time threads spend freeing batches of nodes as epochs change with JEmalloc. (Y-axis = thread ID, blue dot = epoch change, space between boxes = time spent accessing the data structure.)
Figure 3: Timeline graphs showing how long individualfree calls take for batch free vs amortized free. 192 threads.
Figure 4: Comparing the number of garbage nodes in each epoch for batch free (upper) and amortized free (lower) reveals the latter has a smoothing effect on memory usage.
Figure 5: Performance and peak memory usage with JEmalloc, for ABtree, using Naive Token-EBR.
...and 24 more figures

Are Your Epochs Too Epic? Batch Free Can Be Harmful

TL;DR

Abstract

Are Your Epochs Too Epic? Batch Free Can Be Harmful

Authors

TL;DR

Abstract

Table of Contents

Figures (29)