Table of Contents
Fetching ...

A HPC Co-Scheduler with Reinforcement Learning

Abel Souza, Kristiaan Pelckmans, Johan Tordsson

TL;DR

This work addresses the persistent underutilization and long makespans of traditional HPC batch schedulers by introducing ASA_X, a co-scheduler that merges adaptive reinforcement learning with OS-level resource control. The method uses a forest of decision-tree experts to map cluster and application states to co-allocation actions, continually refining decisions through a reward-driven feedback loop. Evaluated on a real Numascale-based cluster with four diverse workloads, ASA_X substantially improves utilization (up to 51%) and reduces queue makespans (up to 55%) while incurring modest runtime overhead (≈10%). The approach demonstrates convergence guarantees for excess risk and shows practical potential to enhance HPC datacenter throughput through application-aware co-allocation integrated with Slurm and Mesos.

Abstract

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications' actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.

A HPC Co-Scheduler with Reinforcement Learning

TL;DR

This work addresses the persistent underutilization and long makespans of traditional HPC batch schedulers by introducing ASA_X, a co-scheduler that merges adaptive reinforcement learning with OS-level resource control. The method uses a forest of decision-tree experts to map cluster and application states to co-allocation actions, continually refining decisions through a reward-driven feedback loop. Evaluated on a real Numascale-based cluster with four diverse workloads, ASA_X substantially improves utilization (up to 51%) and reduces queue makespans (up to 55%) while incurring modest runtime overhead (≈10%). The approach demonstrates convergence guarantees for excess risk and shows practical potential to enhance HPC datacenter throughput through application-aware co-allocation integrated with Slurm and Mesos.

Abstract

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications' actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.
Paper Structure (21 sections, 1 theorem, 11 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 11 equations, 3 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\{\gamma_t>0\}_t$ be a non-increasing sequence. The excess risk $E_t$ after $t$ rounds is then bound by

Figures (3)

  • Figure 1: (a) Batch and (b) ASA$_X$ Architectures: (a) A traditional batch system such as Slurm. The Upside down triangle (green) job waits for resources although the cluster is not fully utilized; (b) ASA$_X$, where rewards follow co-scheduling decision actions, steered through Policy Experts. In this example, the upside down triangle job (green) is co-allocated with the triangle job (blue).
  • Figure 2: A Decision tree (DT) expert structure illustrating the evaluation of the 'CPU' state in an allocation. For each DT 'Metric ( CPU%, Mem%, type, and interval), an action strategy is devised by combining four different distributions $p_{i}$. Then, depending on the state for each 'Metric', one distribution among nine ($p_{i_1}, ..., p_{i_9}$) is returned.
  • Figure 3: Slurm, Static, and ASA$_X$ strategy results - Total (a) Queue makespan and (b) runtime of each cluster size (64, 128, and 256 cores) and scheduling strategy.

Theorems & Definitions (2)

  • Theorem 1
  • proof