Table of Contents
Fetching ...

Libfork: portable continuation-stealing with stackless coroutines

Conor John Williams, James Elliott

TL;DR

This work addresses the challenge of achieving fully-strict fork-join parallelism with bounded memory in shared-memory systems without compiler modifications. It introduces Libfork, a portable C++20 stackless-coroutines-based library that maps fork-join operations to continuation-stealing tasks using segmented stacks and NUMA-aware, lock-free schedulers. Theoretical bounds are derived for memory and time scaling, and extensive benchmarks show Libfork delivering substantial speedups (up to $7.5\times$ vs TBB and $24\times$ vs OpenMP) while using far less memory, across classical SFJ tests and UTS benchmarks. The results motivate the practical value of coroutine-based continuation stealing and set the stage for further optimizations, including HALO-like heap-elision and improved resource management.

Abstract

Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.

Libfork: portable continuation-stealing with stackless coroutines

TL;DR

This work addresses the challenge of achieving fully-strict fork-join parallelism with bounded memory in shared-memory systems without compiler modifications. It introduces Libfork, a portable C++20 stackless-coroutines-based library that maps fork-join operations to continuation-stealing tasks using segmented stacks and NUMA-aware, lock-free schedulers. Theoretical bounds are derived for memory and time scaling, and extensive benchmarks show Libfork delivering substantial speedups (up to vs TBB and vs OpenMP) while using far less memory, across classical SFJ tests and UTS benchmarks. The results motivate the practical value of coroutine-based continuation stealing and set the stage for further optimizations, including HALO-like heap-elision and improved resource management.

Abstract

Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.
Paper Structure (36 sections, 4 theorems, 17 equations, 7 figures, 2 tables, 5 algorithms)

This paper contains 36 sections, 4 theorems, 17 equations, 7 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

A segmented stack storing $M \ge 1$ bytes has a worst-case size of $\text{O}\left(c\right) + c\log_2{\left(M\right)} + 4M$.

Figures (7)

  • Figure 1: The DAG representing the execution of the Fibonacci function from \ref{['alg::fib']} with argument $n=3$. Diamond shaped nodes represent leaf tasks/function-calls while circular nodes represent non-leaf tasks, each with a matched join node. Edges represent dependencies i.e parent-child relationships.
  • Figure 2: A diagram of a work stealing queue, the queue contains handles to tasks during a moment of a depth first traversal of the DAG in \ref{['fig::fib_dag']}.
  • Figure 3: A diagram of a cactus stack, sometimes called a spaghetti stack, which is an example of a parent pointer tree. Boxes represent (stack) frames. Arrows denote a parent-child relationship. The dotted lines represent regions that could be contiguous segments of memory, i.e the first child of each parent can be placed on the parents (linear) stack.
  • Figure 4: Diagram (not to scale) of a segmented stack in libfork. The metadata region is filled in gray, hatched regions indicate allocated space, double ended arrows indicate doubly-linked list connections and, single ended arrows represent each stacklets' stack-pointer. This stack is composed of three stacklets, the middle stacklet is the top stacklet, i.e contains the last allocation. The rightmost stacklet is a cached stacklet, each stack contains zero-or-one cached stacklets.
  • Figure 5: Classical benchmarks, here Busy-LF and Lazy-LF refer to libfork's busy and lazy schedulers respectively.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Definition 1
  • Theorem 1: Segmented stack overhead
  • proof
  • Lemma 1
  • proof
  • Definition 2
  • Lemma 2
  • proof
  • Theorem 2: Stack memory bound
  • proof