Table of Contents
Fetching ...

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang

TL;DR

The paper addresses the challenge of evaluating HPC scheduling policies in the presence of infrastructure effects by introducing HPC Digital Twins extended with scheduling capabilities. It presents S-RAPS, a scheduling-enabled extension of the ExaDigiT framework that integrates built-in and external schedulers, supports open HPC datasets, and enables what-if analyses including incentive structures and ML-guided policies. The key contributions include the architectural refactor enabling scheduling within a digital twin, dataloader extensions for open datasets, interfaces to external schedulers such as ScheduleFlow and FastSim, and use-cases for incentive structures and ML-driven scheduling. This work enables holistic, data-driven analysis of how scheduling decisions impact power, cooling, and system efficiency, with practical implications for design, procurement, and operational planning in HPC centers.

Abstract

Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

TL;DR

The paper addresses the challenge of evaluating HPC scheduling policies in the presence of infrastructure effects by introducing HPC Digital Twins extended with scheduling capabilities. It presents S-RAPS, a scheduling-enabled extension of the ExaDigiT framework that integrates built-in and external schedulers, supports open HPC datasets, and enables what-if analyses including incentive structures and ML-guided policies. The key contributions include the architectural refactor enabling scheduling within a digital twin, dataloader extensions for open datasets, interfaces to external schedulers such as ScheduleFlow and FastSim, and use-cases for incentive structures and ML-driven scheduling. This work enables holistic, data-driven analysis of how scheduling decisions impact power, cooling, and system efficiency, with practical implications for design, procurement, and operational planning in HPC centers.

Abstract

Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.

Paper Structure

This paper contains 44 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Simplified Original overview in accordance with Brewer brewer2024digital, with module on the right.
  • Figure 2: S-RAPS: Integration of scheduling into the design of 's RAPS. With improved configuration mechanisms, pluggable dataloaders, interface to build-in and externals schedulers, and overhauled simulation loop.
  • Figure 3: Example job trace, with job-submit time, -start time and -end time. The time-stepped simulator triggers on each time step, while the event based scheduling simulator only has to react to triggered events (magenta arrows) such as start of a job (job 4), end of a job (job 2), and submission of a new job (job 5).
  • Figure 4: Replay and reschedule of the data from the PM100 Dataset (offset 50 days +17h). Showing with no backfill (fcfs-nobf), with backfill (fcfs-easy), priority scheduling with first-fit backfill (priority-ffbf) and replay as jobs were executed, for system power and utilization.
  • Figure 5: Replay and Reschedule of 15 days of Adastra (full dataset Adastra15D). Replay is shown in blue, while all rescheduled runs ( & priority) overlap almost exactly (brown line). Given known job-power profiles and schedule information, the simulator can predict and match the observed power profile, seen as matching timed up/down-swings.
  • ...and 5 more figures