Minimize Your Critical Path with Combine-and-Exchange Locks
Simon König, Lukas Epple, Christian Becker
TL;DR
The paper addresses throughput bottlenecks in cooperatively scheduled runtimes by showing that existing synchronization either keeps work serial (inline scheduling) or introduces queuing delays (dispatch scheduling), both harming the critical path. It proposes Combine-and-Exchange Scheduling (CES), which keeps contended critical sections on the same thread while migrating non-critical work elsewhere, thereby shortening the critical path and preserving parallelism. The authors provide a CES-based design for a task-aware mutex, extend it to semaphores, reader-writer locks, and condition variables, and implement a CES-based coroutine library with C++20 coroutines and prototypes for Java Virtual Threads and boost::fibers. Extensive evaluation demonstrates up to 8.1x microbenchmark and 3.3x application throughput improvements on NUMA hardware, with improved cache locality and general applicability across languages, making CES a practical, language-agnostic enhancement for cooperative scheduling frameworks.
Abstract
Coroutines are experiencing a renaissance as many modern programming languages support the use of cooperative multitasking for highly parallel or asynchronous applications. One of the greatest advantages of this is that concurrency and synchronization is manged entirely in the userspace, omitting heavy-weight system calls. However, we find that state-of-the-art userspace synchronization primitives approach synchronization in the userspace from the perspective of kernel-level scheduling. This introduces unnecessary delays on the critical path of the application, limiting throughput. In this paper, we re-think synchronization for tasks that are scheduled entirely in the userspace (e.g., coroutines, fibers, etc.). We develop Combine-and-Exchange Scheduling (CES), a novel scheduling approach that ensures contended critical sections stay on the same thread of execution while parallelizable work is evenly spread across the remaining threads. We show that our approach can be applied to many existing languages and libraries, resulting in 3-fold performance improvements in application benchmarks as well as 8-fold performance improvements in microbenchmarks.
