ASA -- The Adaptive Scheduling Algorithm
Abel Souza, Kristiaan Pelckmans, Devarshi Ghoshal, Lavanya Ramakrishnan, Johan Tordsson
TL;DR
The paper tackles prolonged queue waits in HPC batch systems for data-intensive scientific workflows by introducing ASA, an adaptive scheduling algorithm that learns queue waiting times online and proactively submits resource changes to reduce inter-stage waiting. ASA uses a reinforcement-learning-inspired, convergence-proven framework that maintains a distribution over a fixed set of waiting-time alternatives and updates it as workflow stages execute. Real-world experiments across two supercomputers and three representative workflows show ASA achieving near-optimal resource utilization while delivering substantial reductions in average workflow queue waiting times (up to about 10%) and makespan (around 2%), demonstrating robust performance under queue workload variability. The proposed Mesos-based Unified View and proactive scheduling library enable WMS to operate over a global resource pool, offering fault tolerance and elasticity while maintaining workflow ordering and QoS constraints, with promising implications for scalable, low-latency scientific data processing.
Abstract
In High Performance Computing (HPC) infrastructures, the control of resources by batch systems can lead to prolonged queue waiting times and adverse effects on the overall execution times of applications, particularly in data-intensive and low-latency workflows where efficient processing hinges on resource planning and timely allocation. Allocating the maximum capacity upfront ensures the fastest execution but results in spare and idle resources, extended queue waits, and costly usage. Conversely, dynamic allocation based on workflow stage requirements optimizes resource usage but may negatively impact the total workflow makespan. To address these issues, we introduce ASA, the Adaptive Scheduling Algorithm. ASA is a novel, convergence-proven scheduling technique that minimizes jobs inter-stage waiting times by estimating the queue waiting times to proactively submit resource change requests ahead of time. It strikes a balance between exploration and exploitation, considering both learning (waiting times) and applying learnt insights. Real-world experiments over two supercomputers centers with scientific workflows demonstrate ASA's effectiveness, achieving near-optimal resource utilization and accuracy, with up to 10% and 2% reductions in average workflow queue waiting times and makespan, respectively.
