Table of Contents
Fetching ...

Minimize Your Critical Path with Combine-and-Exchange Locks

Simon König, Lukas Epple, Christian Becker

TL;DR

The paper addresses throughput bottlenecks in cooperatively scheduled runtimes by showing that existing synchronization either keeps work serial (inline scheduling) or introduces queuing delays (dispatch scheduling), both harming the critical path. It proposes Combine-and-Exchange Scheduling (CES), which keeps contended critical sections on the same thread while migrating non-critical work elsewhere, thereby shortening the critical path and preserving parallelism. The authors provide a CES-based design for a task-aware mutex, extend it to semaphores, reader-writer locks, and condition variables, and implement a CES-based coroutine library with C++20 coroutines and prototypes for Java Virtual Threads and boost::fibers. Extensive evaluation demonstrates up to 8.1x microbenchmark and 3.3x application throughput improvements on NUMA hardware, with improved cache locality and general applicability across languages, making CES a practical, language-agnostic enhancement for cooperative scheduling frameworks.

Abstract

Coroutines are experiencing a renaissance as many modern programming languages support the use of cooperative multitasking for highly parallel or asynchronous applications. One of the greatest advantages of this is that concurrency and synchronization is manged entirely in the userspace, omitting heavy-weight system calls. However, we find that state-of-the-art userspace synchronization primitives approach synchronization in the userspace from the perspective of kernel-level scheduling. This introduces unnecessary delays on the critical path of the application, limiting throughput. In this paper, we re-think synchronization for tasks that are scheduled entirely in the userspace (e.g., coroutines, fibers, etc.). We develop Combine-and-Exchange Scheduling (CES), a novel scheduling approach that ensures contended critical sections stay on the same thread of execution while parallelizable work is evenly spread across the remaining threads. We show that our approach can be applied to many existing languages and libraries, resulting in 3-fold performance improvements in application benchmarks as well as 8-fold performance improvements in microbenchmarks.

Minimize Your Critical Path with Combine-and-Exchange Locks

TL;DR

The paper addresses throughput bottlenecks in cooperatively scheduled runtimes by showing that existing synchronization either keeps work serial (inline scheduling) or introduces queuing delays (dispatch scheduling), both harming the critical path. It proposes Combine-and-Exchange Scheduling (CES), which keeps contended critical sections on the same thread while migrating non-critical work elsewhere, thereby shortening the critical path and preserving parallelism. The authors provide a CES-based design for a task-aware mutex, extend it to semaphores, reader-writer locks, and condition variables, and implement a CES-based coroutine library with C++20 coroutines and prototypes for Java Virtual Threads and boost::fibers. Extensive evaluation demonstrates up to 8.1x microbenchmark and 3.3x application throughput improvements on NUMA hardware, with improved cache locality and general applicability across languages, making CES a practical, language-agnostic enhancement for cooperative scheduling frameworks.

Abstract

Coroutines are experiencing a renaissance as many modern programming languages support the use of cooperative multitasking for highly parallel or asynchronous applications. One of the greatest advantages of this is that concurrency and synchronization is manged entirely in the userspace, omitting heavy-weight system calls. However, we find that state-of-the-art userspace synchronization primitives approach synchronization in the userspace from the perspective of kernel-level scheduling. This introduces unnecessary delays on the critical path of the application, limiting throughput. In this paper, we re-think synchronization for tasks that are scheduled entirely in the userspace (e.g., coroutines, fibers, etc.). We develop Combine-and-Exchange Scheduling (CES), a novel scheduling approach that ensures contended critical sections stay on the same thread of execution while parallelizable work is evenly spread across the remaining threads. We show that our approach can be applied to many existing languages and libraries, resulting in 3-fold performance improvements in application benchmarks as well as 8-fold performance improvements in microbenchmarks.

Paper Structure

This paper contains 33 sections, 15 figures.

Figures (15)

  • Figure 1: Pseudo code of the lock method of a task-aware mutex. For simplicity, this pseudo code assumes the entire method call is atomic. Implementations use lock-free or locking methods to achieve this.
  • Figure 2: Pseudo code of the unlock method of an inline scheduling mutex. Note that the unlocking task does not suspend and, therefore, cannot resume on another thread! The current task's call to unlock returns only after the waiter suspends from the nested resume call.
  • Figure 3: Pseudo code of the unlock method of a dispatch scheduling mutex. The current task's call to unlock returns immediately. However, the next waiter waits in the ready queue of the executor while no other task can enter the critical section (the mutex state is LOCKED).
  • Figure 4: Delays incurred by dispatch schedulers at critical section boundaries. Patterned regions are critical sections, non-patterned regions are parallelizable.
  • Figure 5: Three tasks on three threads of execution. Each color represents one task, patterned regions indicate that the task is currently inside the critical section.
  • ...and 10 more figures