Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Wenyi Wang; Maxime Gonthier; Poornima Nookala; Haochen Pan; Ian Foster; Ioan Raicu; Kyle Chard

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

Wenyi Wang, Maxime Gonthier, Poornima Nookala, Haochen Pan, Ian Foster, Ioan Raicu, Kyle Chard

TL;DR

This paper tackles the challenge of efficient fine-grained task parallelism on multi-socket many-core systems by replacing GNU OpenMP's lock-heavy synchronization with XQueue, a lock-less MPMC queue, and by adopting a distributed tree barrier to reduce barrier contention. It further enhances performance with two NUMA-aware, lock-less dynamic load-balancing strategies (NA-RP and NA-WS) implemented via a lock-less messaging protocol, together achieving substantial speedups over GNU OpenMP. Evaluations on the Barcelona OpenMP Task Suite and a Proof-of-Space blockchain workload demonstrate dramatic improvements, including up to 1522.8× faster execution and notable throughput gains in blockchain tasks, with guidelines for parameter tuning based on task size. These advances collectively enable scalable, NUMA-aware OpenMP tasking suitable for modern many-core architectures, reducing synchronization overhead and improving data locality for a broad class of parallel applications.

Abstract

Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that the use of XQueue and the distributed tree barrier can improve performance by up to 1522.8$\times$ compared to the original GNU OpenMP. We further show that lock-less load balancing can improve performance by up to 4$\times$ compared to GNU OpenMP using XQueue.

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

TL;DR

Abstract

Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)