Adaptive Asynchronous Work-Stealing for distributed load-balancing in heterogeneous systems
João B. Fernandes, Ítalo A. S. de Assis, Idalmis M. S. Martins, Tiago Barros, Samuel Xavier-de-Souza
TL;DR
The paper tackles load imbalance in heterogeneous HPC systems by introducing Adaptive Asynchronous Work-Stealing (A2WS), a decentralized, fully asynchronous scheduler that relies on limited ring-based information exchange and MPI one-sided communication. It combines smart stealing with a preemptive, radius-limited information propagation strategy and a head-tail asynchronous theft mechanism to redistribute tasks without extra communication threads. Key contributions include the analytic derivation of ideal task distribution, a bounded information radius, and an asynchronous, lock-free deque protocol using MPI atomic operations. Empirical evaluation on a seismic modeling workload shows up to about a 10% reduction in runtime at scale compared to centralized and token-based work-stealing methods, demonstrating improved load balancing and scalability for heterogeneous clusters.
Abstract
Supercomputers have revolutionized how industries and scientific fields process large amounts of data. These machines group hundreds or thousands of computing nodes working together to execute time-consuming programs that require a large amount of computational resources. Over the years, supercomputers have expanded to include new and different technologies characterizing them as heterogeneous. However, executing a program in a heterogeneous environment requires attention to a specific aspect of performance degradation: load imbalance. In this research, we address the challenges associated with load imbalance when scheduling many homogeneous tasks in a heterogeneous environment. To address this issue, we introduce the concept of adaptive asynchronous work-stealing. This approach collects information about the nodes and utilizes it to improve work-stealing aspects, such as victim selection and task offloading. Additionally, the proposed approach eliminates the need for extra threads to communicate information, thereby reducing overhead when implementing a fully asynchronous approach. Our experimental results demonstrate a performance improvement of approximately 10.1\% compared to other conventional and state-of-the-art implementations.
