Table of Contents
Fetching ...

Improving Multiresource Job Scheduling with Markovian Service Rate Policies

Zhongrui Chen, Isaac Grosof, Benjamin Berg

TL;DR

The paper tackles the challenge of minimizing mean response time for multiresource jobs on a single server by introducing Markovian Service Rate (MSR) policies. MSR uses a finite-state CTMC to select from a predefined set of schedulable action vectors, computed offline, ensuring stability (throughput-optimality) under various preemption models, while enabling additively tight response-time bounds and a low online decision cost. A decoupled queueing analysis based on MSR-1 systems yields bounds and a practical queue-length approximation, guiding policy design via an MIQCP offline optimization and a loop-structured online policy with $O(1)$ per-decision complexity. Empirical evaluation on a Google Borg trace and complementary simulations show MSR policies, especially with BackFilling, can substantially reduce mean response time at moderate to high loads compared to MaxWeight, FCFS, and Randomized-Timers, while maintaining tractability and robustness to preemption type. The work offers a practical, theoretically-grounded framework for designing low-complexity, throughput-optimal schedulers for complex cloud workloads with multiple resource types.

Abstract

Modern cloud computing workloads are composed of multiresource jobs that require a variety of computational resources in order to run, such as CPU cores, memory, disk space, or hardware accelerators. A single cloud server can typically run many multiresource jobs in parallel, but only if the server has sufficient resources to satisfy the demands of every job. A scheduling policy must therefore select sets of multiresource jobs to run in parallel in order to minimize the mean response time across jobs -- the average time from when a job arrives to the system until it is completed. Unfortunately, achieving low response times by selecting sets of jobs that fully utilize the available server resources has proven to be a difficult problem. In this paper, we develop and analyze a new class of policies for scheduling multiresource jobs, called Markovian Service Rate (MSR) policies. While prior scheduling policies for multiresource jobs are either highly complex to analyze or hard to implement, our MSR policies are simple to implement and are amenable to response time analysis. We show that the class of MSR policies is throughput-optimal in that we can use an MSR policy to stabilize the system whenever it is possible to do so. We also derive bounds on the mean response time under an MSR algorithm that are tight up to an additive constant. These bounds can be applied to systems with different preemption behaviors, such as fully preemptive systems, non-preemptive systems, and systems that allow preemption with setup times. We show how our theoretical results can be used to select a good MSR policy as a function of the system arrival rates, job service requirements, the server's resource capacities, and the resource demands of the jobs.

Improving Multiresource Job Scheduling with Markovian Service Rate Policies

TL;DR

The paper tackles the challenge of minimizing mean response time for multiresource jobs on a single server by introducing Markovian Service Rate (MSR) policies. MSR uses a finite-state CTMC to select from a predefined set of schedulable action vectors, computed offline, ensuring stability (throughput-optimality) under various preemption models, while enabling additively tight response-time bounds and a low online decision cost. A decoupled queueing analysis based on MSR-1 systems yields bounds and a practical queue-length approximation, guiding policy design via an MIQCP offline optimization and a loop-structured online policy with per-decision complexity. Empirical evaluation on a Google Borg trace and complementary simulations show MSR policies, especially with BackFilling, can substantially reduce mean response time at moderate to high loads compared to MaxWeight, FCFS, and Randomized-Timers, while maintaining tractability and robustness to preemption type. The work offers a practical, theoretically-grounded framework for designing low-complexity, throughput-optimal schedulers for complex cloud workloads with multiple resource types.

Abstract

Modern cloud computing workloads are composed of multiresource jobs that require a variety of computational resources in order to run, such as CPU cores, memory, disk space, or hardware accelerators. A single cloud server can typically run many multiresource jobs in parallel, but only if the server has sufficient resources to satisfy the demands of every job. A scheduling policy must therefore select sets of multiresource jobs to run in parallel in order to minimize the mean response time across jobs -- the average time from when a job arrives to the system until it is completed. Unfortunately, achieving low response times by selecting sets of jobs that fully utilize the available server resources has proven to be a difficult problem. In this paper, we develop and analyze a new class of policies for scheduling multiresource jobs, called Markovian Service Rate (MSR) policies. While prior scheduling policies for multiresource jobs are either highly complex to analyze or hard to implement, our MSR policies are simple to implement and are amenable to response time analysis. We show that the class of MSR policies is throughput-optimal in that we can use an MSR policy to stabilize the system whenever it is possible to do so. We also derive bounds on the mean response time under an MSR algorithm that are tight up to an additive constant. These bounds can be applied to systems with different preemption behaviors, such as fully preemptive systems, non-preemptive systems, and systems that allow preemption with setup times. We show how our theoretical results can be used to select a good MSR policy as a function of the system arrival rates, job service requirements, the server's resource capacities, and the resource demands of the jobs.

Paper Structure

This paper contains 41 sections, 17 theorems, 37 equations, 6 figures, 2 tables.

Key Result

Theorem 1

When scheduling preemptible jobs with no preemption overhead, the class of MSR policies is throughput-optimal. That is, if there exists a scheduling policy that can stabilize a multiresource job system with $K$ job types and arrival rates $\bm{\lambda}$, then there exists an MSR policy, $p$, with $N

Figures (6)

  • Figure 1: Two views of the multiresource job system under an MSR policy. While arriving jobs are stored in a central queue, MSR policies allow us to analyze jobs of each type separately.
  • Figure 2: The effect of switching rate on MSR policies under various preemption behaviors in the example from Section \ref{['sec:preemptions']}. The shaded regions depict our queue length bounds. Our pMSR policies benefit from a high switching rate when system load $\rho=0.9$. However, nMSR policies suffer when $\alpha$ is too high or too low. Our queue length prediction is accurate and can be used to select $\alpha^*$ in both cases.
  • Figure 3: Mean queue length of the sMSR policy $r(\alpha^*)$ as a function of setup rate, $\gamma$ when $\rho=0.9$. The shaded regions depict our queue length bounds for $r(\alpha^*)$. We compare the performance of $r(\alpha^*)$ to the nMSR policy, $q(\alpha^*)$, and the pMSR policy, $p$.
  • Figure 4: Mean queue length under pMSR and nMSR policies in the example from Section \ref{['sec:preemptions']}. The pMSR policy uses $\alpha=2$. We use $\alpha^*$ for the nMSR policy at each load. The shaded regions depict the queue length bounds for our MSR policies. The dashed lines represent our queue length approximations.
  • Figure 5: Evaluation of MSR policies using a Google Borg trace. The nMSR policy without BackFilling is not pictured, but performed roughly 10x worse than the policies shown due to excessive unused service. The pMSR and nMSR policies with BackFilling provide low mean response times at all loads.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Definition 1: MSR Policies
  • Theorem 1
  • Corollary 0
  • Theorem 1
  • Lemma 0: Stability Condition
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Definition 2
  • ...and 11 more