Table of Contents
Fetching ...

MRSch: Multi-Resource Scheduling for HPC

Boyang Li, Yuping Fan, Matthew Dearing, Zhiling Lan, Paul Richy, William Allcocky, Michael Papka

TL;DR

MRSch tackles multi-resource HPC scheduling by applying Direct Future Prediction (DFP), a multi-objective reinforcement learning approach that uses a dynamically weighted goal vector to optimize long-term resource utilization. It represents jobs and resources as vectors, employs a single MLP-based state module, implements a window-based reservation with EASY backfilling, and trains across real, sampled, and synthetic workloads to achieve robust performance. In trace-based simulations on Theta/ALCF data, MRSch outperforms heuristic, optimization-based, and scalar-RL baselines by up to 48% across system- and user-level metrics, and demonstrates adaptability to workload shifts and extensibility to additional resources. The work also discusses deployment challenges, notably interpretability, and outlines future directions for making RL-driven HPC schedulers more transparent while maintaining performance gains.

Abstract

Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive tracebase simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.

MRSch: Multi-Resource Scheduling for HPC

TL;DR

MRSch tackles multi-resource HPC scheduling by applying Direct Future Prediction (DFP), a multi-objective reinforcement learning approach that uses a dynamically weighted goal vector to optimize long-term resource utilization. It represents jobs and resources as vectors, employs a single MLP-based state module, implements a window-based reservation with EASY backfilling, and trains across real, sampled, and synthetic workloads to achieve robust performance. In trace-based simulations on Theta/ALCF data, MRSch outperforms heuristic, optimization-based, and scalar-RL baselines by up to 48% across system- and user-level metrics, and demonstrates adaptability to workload shifts and extensibility to additional resources. The work also discusses deployment challenges, notably interpretability, and outlines future directions for making RL-driven HPC schedulers more transparent while maintaining performance gains.

Abstract

Emerging workloads in high-performance computing (HPC) are embracing significant changes, such as having diverse resource requirements instead of being CPU-centric. This advancement forces cluster schedulers to consider multiple schedulable resources during decision-making. Existing scheduling studies rely on heuristic or optimization methods, which are limited by an inability to adapt to new scenarios for ensuring long-term scheduling performance. We present an intelligent scheduling agent named MRSch for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm. While DFP demonstrated outstanding performance in a gaming competition, it has not been previously explored in the context of HPC scheduling. Several key techniques are developed in this study to tackle the challenges involved in multi-resource scheduling. These techniques enable MRSch to learn an appropriate scheduling policy automatically and dynamically adapt its policy in response to workload changes via dynamic resource prioritizing. We compare MRSch with existing scheduling methods through extensive tracebase simulations. Our results demonstrate that MRSch improves scheduling performance by up to 48% compared to the existing scheduling methods.
Paper Structure (22 sections, 1 equation, 10 figures, 3 tables)

This paper contains 22 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: An example illustrating the limitation when fixing the priority of each objective for job scheduling.
  • Figure 2: Overview of MRSch. The environment (the top portion) denotes the HPC multi-resource scheduling system. The MRSch agent (the bottom portion) contains three input modules (state, measurement, and goal) and interacts with the environment by observing environmental changes and making scheduling decisions (i.e., selecting jobs for execution). The arrows between the agent and the environment indicate the information flows between them.
  • Figure 3: Comparison of MRSch scheduling performance by using different state modules (MLP vs CNN) that indicates the use of MLP is more beneficial for multi-resource scheduling.
  • Figure 4: Comparison of the quality and convergence of MRSch by training with different jobset orderings. The loss function is expressed by the mean squared error.
  • Figure 5: Scheduling performance in terms of system-level metrics.
  • ...and 5 more figures