Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Jonatha Anselmi; Josu Doncel

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Jonatha Anselmi, Josu Doncel

TL;DR

Load Balancing with Job-Size Testing develops a Markovian framework for scheduling with testing in a multi-server setting where jobs have a two-point size distribution and a controllable testing time $\sigma$. The analysis derives a mean waiting time metric $D_c^{(N)}(\sigma)$ that combines testing delay and queueing delay via the Pollaczek–Khinchine formula, and studies two limiting regimes: (i) large systems with $N\to\infty$ and (ii) heavy-tailed job sizes with increasing variability. Key findings show that in the large-system limit testing generally degrades performance unless short-job predictions are instant and load is high, while in heavy-tailed regimes appropriately designed testing can yield vanishing or arbitrarily large gains; numerical results with neuroscience-like distributions corroborate the theory and show robustness to deviations from the two-point model. The work provides practical guidance on when to deploy testing policies in HPC-like environments and highlights the importance of traffic conditions and job-size variability in determining their effectiveness.

Abstract

In the context of decision making under explorable uncertainty, scheduling with testing is a powerful technique used in the management of computer systems to improve performance via better job-dispatching decisions. Upon job arrival, a scheduler may run some \emph{testing algorithm} against the job to extract some information about its structure, e.g., its size, and properly classify it. The acquisition of such knowledge comes with a cost because the testing algorithm delays the dispatching decisions, though this is under control. In this paper, we analyze the impact of such extra cost in a load balancing setting by investigating the following questions: does it really pay off to test jobs? If so, under which conditions? Under mild assumptions connecting the information extracted by the testing algorithm in relationship with its running time, we show that whether scheduling with testing brings a performance degradation or improvement strongly depends on the traffic conditions, system size and the coefficient of variation of job sizes. Thus, the general answer to the above questions is non-trivial and some care should be considered when deploying a testing policy. Our results are achieved by proposing a load balancing model for scheduling with testing that we analyze in two limiting regimes. When the number of servers grows to infinity in proportion to the network demand, we show that job-size testing actually degrades performance unless short jobs can be predicted reliably almost instantaneously and the network load is sufficiently high. When the coefficient of variation of job sizes grows to infinity, we construct testing policies inducing an arbitrarily large performance gain with respect to running jobs untested.

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

TL;DR

Load Balancing with Job-Size Testing develops a Markovian framework for scheduling with testing in a multi-server setting where jobs have a two-point size distribution and a controllable testing time

. The analysis derives a mean waiting time metric

that combines testing delay and queueing delay via the Pollaczek–Khinchine formula, and studies two limiting regimes: (i) large systems with

and (ii) heavy-tailed job sizes with increasing variability. Key findings show that in the large-system limit testing generally degrades performance unless short-job predictions are instant and load is high, while in heavy-tailed regimes appropriately designed testing can yield vanishing or arbitrarily large gains; numerical results with neuroscience-like distributions corroborate the theory and show robustness to deviations from the two-point model. The work provides practical guidance on when to deploy testing policies in HPC-like environments and highlights the importance of traffic conditions and job-size variability in determining their effectiveness.

Abstract

Paper Structure (31 sections, 7 theorems, 69 equations, 5 figures)

This paper contains 31 sections, 7 theorems, 69 equations, 5 figures.

Introduction
Motivation
Related Work
Contribution
Organization
Load Balancing with Job-Size Testing
Architecture
Jobs
Scheduler
Dispatching Policy
Testing Algorithm: Quality of Prediction vs Running Time
Performance Measure
Additional Notation
Large Systems
Limiting Regime
...and 16 more sections

Key Result

Proposition 1

Assume that eq:tau_def holds. For any $\tau\ge 0$,

Figures (5)

Figure 1: Architecture of the proposed model for load balancing with job-size testing. It is assumed that $c\in\mathbb{N}$.
Figure 2: Plots of the efficiency measure $\mathcal{E}$, see \ref{['eq:eff_def']}, assuming that \ref{['def:HV']} holds with $\alpha=1$ and $f(\cdot)=\cdot$. Also, $P_{x_m,x_m}(\sigma)=(1-e^{-10\sigma}) (\mathbb{P}(X=x_m)-P_{x_m,x_m}(0)) + P_{x_m,x_m}(0)$, $P_{x_M,x_M}(\sigma)=(1-e^{-\sigma}) (\mathbb{P}(X=x_M)-P_{x_M,x_M}(0)) + P_{x_M,x_M}(0)$, $P_{x_m,x_M}(\sigma)=\mathbb{P}(X=x_m)-P_{x_m,x_m}(\sigma)$ and $P_{x_M,x_m}(\sigma)=\mathbb{P}(X=x_M)-P_{x_M,x_M}(\sigma)$, which means that the profile matrix $P_a(\sigma)$ agrees with the law of diminishing returns.
Figure 3: Plots of the efficiency measure \ref{['eq:eff_def']} with respect to job-size distributions from neuroscience applications aupy, where $x_m=25$ and $x_M=540$. The vertical lines denote the heuristic testing time choice given in \ref{['eq:design_sigma']}. Blue resp. red lines refer to a system with $N=10$ resp. $N=100$ servers.
Figure 4: Bimodal distribution of job sizes from neuroscience applications. The vertical dashed line distinguishes between short and long jobs.
Figure 5: Plots of the efficiency measure $\mathcal{E}$. The continuous resp. dashed lines refer to a system where job sizes are captured by $X$, with density given in \ref{['eq:f_Xx']}, resp. $\tilde{X}$. Blue resp. red lines refer to a system with $N=10$ resp. $N=100$ servers. The vertical lines denote the heuristic testing time choice given in \ref{['eq:design_sigma']}.

Theorems & Definitions (10)

Remark 1
Remark 2
Proposition 1
Remark 3
Theorem 1
Theorem 2
Proposition 2
Corollary 1
Proposition 3
Theorem 3

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

TL;DR

Abstract

Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (10)