HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling
Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang
TL;DR
The paper addresses the challenge of evaluating HPC scheduling policies in the presence of infrastructure effects by introducing HPC Digital Twins extended with scheduling capabilities. It presents S-RAPS, a scheduling-enabled extension of the ExaDigiT framework that integrates built-in and external schedulers, supports open HPC datasets, and enables what-if analyses including incentive structures and ML-guided policies. The key contributions include the architectural refactor enabling scheduling within a digital twin, dataloader extensions for open datasets, interfaces to external schedulers such as ScheduleFlow and FastSim, and use-cases for incentive structures and ML-driven scheduling. This work enables holistic, data-driven analysis of how scheduling decisions impact power, cooling, and system efficiency, with practical implications for design, procurement, and operational planning in HPC centers.
Abstract
Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.
