Towards providing reliable job completion time predictions using PCS
Abdullah Bin Faisal, Noah Martin, Hafiz Mohsin Bashir, Swaminathan Lamelas, Fahad R. Dogar
TL;DR
The paper tackles the challenge of providing reliable job completion time (JCT) predictions in cloud environments by introducing PCS, a predictability-centric scheduling framework built on Weighted-Fair-Queuing (WFQ) and a simulation-based search for Pareto-optimal configurations. PCS delivers accurate JCT predictions while allowing operators to balance performance and fairness through a bi-directional preference interface and a Pareto-front of WFQ configurations. The approach demonstrates substantial gains in predictability with only modest sacrifices in JCT and fairness for DNN GPU workloads, validated through a real testbed and large-scale simulations. This work offers a practical path toward integrating reliable JCT predictions into cloud scheduling, enabling informed user decisions and inter-cloud orchestration, and it lays groundwork for extending predictability to other resource types and workloads.
Abstract
In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
