Table of Contents
Fetching ...

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal, Clark Gaylord, Jake Messick

Abstract

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to 3.4 minutes for GPU workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system configuration, but on aligning observability, user engagement, and operational design.

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Abstract

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to 3.4 minutes for GPU workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system configuration, but on aligning observability, user engagement, and operational design.

Paper Structure

This paper contains 7 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: CPU workloads running on a GPU node, under-utilizing the available GPU devices.
  • Figure 2: Daily fraction of jobs using explicit resource declarations for CPU and GPU resources during the transition period. The data show progressive but non-monotonic adoption of TRES-based submission patterns, reflecting variability in workload composition and user behavior over time.
  • Figure 3: Kaplan--Meier estimate of continued TRES usage as a function of Jobs Until Reversion to legacy (JUR) (log-scaled horizontal axis). The curve shows a sharp early decline followed by a plateau, indicating stable retention among users who persist beyond the initial jobs.