Table of Contents
Fetching ...

Towards providing reliable job completion time predictions using PCS

Abdullah Bin Faisal, Noah Martin, Hafiz Mohsin Bashir, Swaminathan Lamelas, Fahad R. Dogar

TL;DR

The paper tackles the challenge of providing reliable job completion time (JCT) predictions in cloud environments by introducing PCS, a predictability-centric scheduling framework built on Weighted-Fair-Queuing (WFQ) and a simulation-based search for Pareto-optimal configurations. PCS delivers accurate JCT predictions while allowing operators to balance performance and fairness through a bi-directional preference interface and a Pareto-front of WFQ configurations. The approach demonstrates substantial gains in predictability with only modest sacrifices in JCT and fairness for DNN GPU workloads, validated through a real testbed and large-scale simulations. This work offers a practical path toward integrating reliable JCT predictions into cloud scheduling, enabling informed user decisions and inter-cloud orchestration, and it lays groundwork for extending predictability to other resource types and workloads.

Abstract

In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.

Towards providing reliable job completion time predictions using PCS

TL;DR

The paper tackles the challenge of providing reliable job completion time (JCT) predictions in cloud environments by introducing PCS, a predictability-centric scheduling framework built on Weighted-Fair-Queuing (WFQ) and a simulation-based search for Pareto-optimal configurations. PCS delivers accurate JCT predictions while allowing operators to balance performance and fairness through a bi-directional preference interface and a Pareto-front of WFQ configurations. The approach demonstrates substantial gains in predictability with only modest sacrifices in JCT and fairness for DNN GPU workloads, validated through a real testbed and large-scale simulations. This work offers a practical path toward integrating reliable JCT predictions into cloud scheduling, enabling informed user decisions and inter-cloud orchestration, and it lays groundwork for extending predictability to other resource types and workloads.

Abstract

In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
Paper Structure (55 sections, 1 equation, 11 figures, 1 table)

This paper contains 55 sections, 1 equation, 11 figures, 1 table.

Figures (11)

  • Figure 1: Toy example with 1 GPU, demonstrating the limitation of existing strategies. (a) shows how the scheduling order changes as jobs arrive over time under the Tiresias tiresias, Themis themis, and FIFO YARN schedulers. Time moves from left to right with a new job arriving in each column. The expected finish times for the current jobs are displayed above the current schedule. Jobs that are finished are grayed out. (b) summarizes the results for performance, fairness, and predictability for these policies.
  • Figure 2: Key components of PCS: The preference framework can be used by operators to specify high level objectives. The preference solver uses a simulation-based search strategy to find Pareto-optimal WFQ configurations that are then shared with the operator. On the critical path, users submit their jobs along with the job's demand function and are given a JCTpred.
  • Figure 3: Pareto front of the trade-off between Pred$_{err}$ and normalized average JCT for workload-2 (§\ref{['sec:eval']}). Better indicates WFQ configurations that achieve a tight bound on average/tail Pred$_{err}$ while incurring the smallest possible increase in average JCT.
  • Figure 4: [Testbed] Distribution of Pred$_{err}$ showcasing three configurations of PCS discovered by PCS --- performance oriented, predictability oriented and balanced compared to other schemes.
  • Figure 5: [Testbed] Zooming into the trade-off between performance and predictability. PCS is within 1.1$\times{}$ AFS at p90 JCT, with significant improvement to predictability.
  • ...and 6 more figures