Table of Contents
Fetching ...

A Performance Analysis of Task Scheduling for UQ Workflows on HPC Systems

Chung Ming Loi, Anne Reinarz, Mikkel Lykkegaard, William Hornsby, James Buchanan, Linus Seelinger

TL;DR

Uncertainty quantification workloads on HPC systems create a deluge of short, unpredictable tasks that stress traditional schedulers. The authors introduce a non-intrusive load-balancing framework that sits atop existing schedulers using UM-Bridge and HyperQueue to orchestrate forward-model evaluations, demonstrated with GS2 plasma turbulence simulations and a Gaussian Process surrogate. They show scheduling overhead can be reduced by up to three orders of magnitude and long-running tasks experience up to ~38% less CPU time compared to naive SLURM approaches, though startup overhead can affect fast tasks. The approach is broadly applicable to loosely-coupled UQ workflows and offers practical benefits for resource utilization and scalability on existing HPC infrastructure, with future work aimed at persistent servers and more complex workflow dependencies.

Abstract

Uncertainty Quantification (UQ) workloads are becoming increasingly common in science and engineering. They involve the submission of thousands or even millions of similar tasks with potentially unpredictable runtimes, where the total number is usually not known a priori. A static one-size-fits-all batch script would likely lead to suboptimal scheduling, and native schedulers installed on High Performance Computing (HPC) systems such as SLURM often struggle to efficiently handle such workloads. In this paper, we introduce a new load balancing approach suitable for UQ workflows. To demonstrate its efficiency in a real-world setting, we focus on the GS2 gyrokinetic plasma turbulence simulator. Individual simulations can be computationally demanding, with runtimes varying significantly-from minutes to hours-depending on the high-dimensional input parameters. Our approach uses UQ and Modelling Bridge, which offers a language-agnostic interface to a simulation model, combined with HyperQueue which works alongside the native scheduler. In particular, deploying this framework on HPC systems does not require system-level changes. We benchmark our proposed framework against a standalone SLURM approach using GS2 and a Gaussian Process surrogate thereof. Our results demonstrate a reduction in scheduling overhead by up to three orders of magnitude and a maximum reduction of 38% in CPU time for long-running simulations compared to the naive SLURM approach, while making no assumptions about the job submission patterns inherent to UQ workflows.

A Performance Analysis of Task Scheduling for UQ Workflows on HPC Systems

TL;DR

Uncertainty quantification workloads on HPC systems create a deluge of short, unpredictable tasks that stress traditional schedulers. The authors introduce a non-intrusive load-balancing framework that sits atop existing schedulers using UM-Bridge and HyperQueue to orchestrate forward-model evaluations, demonstrated with GS2 plasma turbulence simulations and a Gaussian Process surrogate. They show scheduling overhead can be reduced by up to three orders of magnitude and long-running tasks experience up to ~38% less CPU time compared to naive SLURM approaches, though startup overhead can affect fast tasks. The approach is broadly applicable to loosely-coupled UQ workflows and offers practical benefits for resource utilization and scalability on existing HPC infrastructure, with future work aimed at persistent servers and more complex workflow dependencies.

Abstract

Uncertainty Quantification (UQ) workloads are becoming increasingly common in science and engineering. They involve the submission of thousands or even millions of similar tasks with potentially unpredictable runtimes, where the total number is usually not known a priori. A static one-size-fits-all batch script would likely lead to suboptimal scheduling, and native schedulers installed on High Performance Computing (HPC) systems such as SLURM often struggle to efficiently handle such workloads. In this paper, we introduce a new load balancing approach suitable for UQ workflows. To demonstrate its efficiency in a real-world setting, we focus on the GS2 gyrokinetic plasma turbulence simulator. Individual simulations can be computationally demanding, with runtimes varying significantly-from minutes to hours-depending on the high-dimensional input parameters. Our approach uses UQ and Modelling Bridge, which offers a language-agnostic interface to a simulation model, combined with HyperQueue which works alongside the native scheduler. In particular, deploying this framework on HPC systems does not require system-level changes. We benchmark our proposed framework against a standalone SLURM approach using GS2 and a Gaussian Process surrogate thereof. Our results demonstrate a reduction in scheduling overhead by up to three orders of magnitude and a maximum reduction of 38% in CPU time for long-running simulations compared to the naive SLURM approach, while making no assumptions about the job submission patterns inherent to UQ workflows.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Top: Pre-defined Kubernetes configuration for parallel instances of any UM-Bridge model container. Bottom: Load balancer configuration for parallel instances of any UM-Bridge model container.
  • Figure 2: Top: Three functions drawn from a gp posterior distribution where $\times$ are 4 training data points. Bottom: Mean and uncertainty obtained from the trained gp. Again, $\times$ represents 4 training data points, and the shaded blue region corresponds to the 95% confidence interval.
  • Figure 3: Boxplots showing experimental results with 2 jobs (left column) and 10 jobs (right column) filling the queue. For each application (listed on the x-axis), the left (blue) boxes represent data collected from SLURM and the right (red) boxes represent data from hq. The top row shows the makespan, the middle row the CPU time, and the bottom row shows the scheduler overhead, all measured in seconds.
  • Figure 4: Boxplots showing the SLR for two jobs filling the queue (top) and 10 jobs filling the queue (bottom). For each application (listed on the x-axis), the left (blue) boxes represent data collected from SLURM and the right (red) boxes represent data from HQ.
  • Figure 5: Boxplots showing the SLR for two jobs filling the queue (top) and 10 jobs filling the queue (bottom). The left (blue) box on the x-axis represents data collected from SLURM, and the right (purple) box represents data from the umbridge SLURM backend.
  • ...and 1 more figures