Table of Contents
Fetching ...

Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator

Adam H. Ross, Vairavan Palaniappan, Debjit Pal

Abstract

Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.

Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator

Abstract

Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.

Paper Structure

This paper contains 45 sections, 5 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Illustration of the decision space for Stochastic, Online Heterogeneous Scheduling. Load Balancing and Resource Contention are standard considerations in multi-machine scheduling, with the other axis presenting new considerations in heterogeneous system context.
  • Figure 2: Stochastic Online Scheduling (SOS) Algorithmic Flows. \ref{['fig:alg_flow_task_perspective']} shows algorithmic flow for stochastic online scheduling from a task centric perspective. Phase I prepares a job for the scheduler, Phase II and Phase III show the steps involved in scheduling the job. \ref{['fig:STANNIC_alg_flow']} revises the algorithmic flow involving the same functional steps but from a persistent, virtual-schedule perspective. Due to this re-framing, the algorithmic flow is now cyclical instead of linear as in \ref{['fig:alg_flow_task_perspective']}. For clarity, the we used same phase labeling to demonstrate the functional similarity of the two algorithmic flows.
  • Figure 3: Systolic Flow Example. A $5 \times 5$ systolic array for matrix multiplication Systolic_Ex. Each PE is responsible for accumulating the value of its corresponding index in the output matrix $e$. To do this, each PE cascades row data from input matrix $c$ to their right neighbor and column data from input matrix $d$ to their downward neighbor.
  • Figure 4: Top-level block diagram of the $\hbox{\scshape Hercules}$ scheduler. Phase II and III are the phases shown in \ref{['fig:alg_flow_task_perspective']}.
  • Figure 5: Job Metadata Memory register implementation. $x$: Configurable based on the maximum number of jobs across all machines computed as $\lceil {log_2 (M \times N)} \rceil$. M: Number of machines. N: Max. number of jobs in $V_i$ of machine $M_i$. This leads to a total register width of $x+24$ bits.
  • ...and 10 more figures