Table of Contents
Fetching ...

Adaptive Asynchronous Work-Stealing for distributed load-balancing in heterogeneous systems

João B. Fernandes, Ítalo A. S. de Assis, Idalmis M. S. Martins, Tiago Barros, Samuel Xavier-de-Souza

TL;DR

The paper tackles load imbalance in heterogeneous HPC systems by introducing Adaptive Asynchronous Work-Stealing (A2WS), a decentralized, fully asynchronous scheduler that relies on limited ring-based information exchange and MPI one-sided communication. It combines smart stealing with a preemptive, radius-limited information propagation strategy and a head-tail asynchronous theft mechanism to redistribute tasks without extra communication threads. Key contributions include the analytic derivation of ideal task distribution, a bounded information radius, and an asynchronous, lock-free deque protocol using MPI atomic operations. Empirical evaluation on a seismic modeling workload shows up to about a 10% reduction in runtime at scale compared to centralized and token-based work-stealing methods, demonstrating improved load balancing and scalability for heterogeneous clusters.

Abstract

Supercomputers have revolutionized how industries and scientific fields process large amounts of data. These machines group hundreds or thousands of computing nodes working together to execute time-consuming programs that require a large amount of computational resources. Over the years, supercomputers have expanded to include new and different technologies characterizing them as heterogeneous. However, executing a program in a heterogeneous environment requires attention to a specific aspect of performance degradation: load imbalance. In this research, we address the challenges associated with load imbalance when scheduling many homogeneous tasks in a heterogeneous environment. To address this issue, we introduce the concept of adaptive asynchronous work-stealing. This approach collects information about the nodes and utilizes it to improve work-stealing aspects, such as victim selection and task offloading. Additionally, the proposed approach eliminates the need for extra threads to communicate information, thereby reducing overhead when implementing a fully asynchronous approach. Our experimental results demonstrate a performance improvement of approximately 10.1\% compared to other conventional and state-of-the-art implementations.

Adaptive Asynchronous Work-Stealing for distributed load-balancing in heterogeneous systems

TL;DR

The paper tackles load imbalance in heterogeneous HPC systems by introducing Adaptive Asynchronous Work-Stealing (A2WS), a decentralized, fully asynchronous scheduler that relies on limited ring-based information exchange and MPI one-sided communication. It combines smart stealing with a preemptive, radius-limited information propagation strategy and a head-tail asynchronous theft mechanism to redistribute tasks without extra communication threads. Key contributions include the analytic derivation of ideal task distribution, a bounded information radius, and an asynchronous, lock-free deque protocol using MPI atomic operations. Empirical evaluation on a seismic modeling workload shows up to about a 10% reduction in runtime at scale compared to centralized and token-based work-stealing methods, demonstrating improved load balancing and scalability for heterogeneous clusters.

Abstract

Supercomputers have revolutionized how industries and scientific fields process large amounts of data. These machines group hundreds or thousands of computing nodes working together to execute time-consuming programs that require a large amount of computational resources. Over the years, supercomputers have expanded to include new and different technologies characterizing them as heterogeneous. However, executing a program in a heterogeneous environment requires attention to a specific aspect of performance degradation: load imbalance. In this research, we address the challenges associated with load imbalance when scheduling many homogeneous tasks in a heterogeneous environment. To address this issue, we introduce the concept of adaptive asynchronous work-stealing. This approach collects information about the nodes and utilizes it to improve work-stealing aspects, such as victim selection and task offloading. Additionally, the proposed approach eliminates the need for extra threads to communicate information, thereby reducing overhead when implementing a fully asynchronous approach. Our experimental results demonstrate a performance improvement of approximately 10.1\% compared to other conventional and state-of-the-art implementations.
Paper Structure (13 sections, 13 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Diagram exemplifying the information-communication of $3$ processes ($p_2$, $p_3$, $p_4$) of a system with $8$ processors and $R = 2$. The arrows are representations of the MPI Put operation. Therewith, we can see all possible operations on the $p_3$ information vector.
  • Figure 2: Diagrams of the get task and steal task A2WS operations demonstrating the steps order of MPI lock and unlock of the head, tail, and deque. In \ref{['subfig:get_task']}, we see the get task operation perspective of the owner of the task deque in steps I to IV. Whereas, in \ref{['subfig:steal_task']}, we see the thief operation in steps I to V.
  • Figure 3: Sequences of operations are performed in the stealing step to obtain information from the victim deque and update the victim tail. It is divided into the raw version (Figure \ref{['subfig:ht old']}) with head and tail as separate data, commonly used in work-stealing with MPI, and the adapted version (Figure \ref{['subfig:ht new']}) with head and tail as a proposed single data structure. The bold arrows represent long operations involving data transfer, while the short operations do not. The dashed arrows are Occasional operations that occur only when there is a disparity between the local and victim information and necessitate stealing adjustments.
  • Figure 4: $5$ samples of runtime test in minutes varying the A2WS radius from $R1$ ($R = 1$) to $R32$ ($R = 32$) using the Configuration $04$ of Table \ref{['tab:info_flag']} and $3840$ tasks. The subplot represents a subset of the data from $R8$ to $R32$.
  • Figure 5: Runtime (in minutes) per task of A2WS, CTWS and LW using Configuration$1$ in Table \ref{['tab:tests_config']} and $480$ tasks. At the end of each line is described the number of tasks executed.