Table of Contents
Fetching ...

Stability and Optimization of Speculative Queueing Networks

Jonatha Anselmi, Neil Walton

TL;DR

This paper develops a theoretical framework for speculative queueing networks that mitigate stragglers by timeouts and re-routing, distinguishing speculation from traditional replication. It establishes a fluid-model stability condition, showing that the system is positive Harris recurrent whenever the nominal per-queue loads satisfy $\rho_i<1$, and derives criteria under which timeouts expand the stability region. It then designs the optimal timeout via an optimal stopping formulation, providing general conditions and a practical rule that reduces to $\frac{f_1(\tau)}{\bar{F}_1(\tau)} = \frac{1}{\mathbb{E}[\eta_2]}$ under independence; comparisons with replication demonstrate that speculative load balancing often outperforms replication in moderate-to-heavy load regimes, with light loads favoring redundancy. The work also offers a large-system mean-field conjecture for mean response time and discusses practical considerations, including reinforcement-learning-based timeout design and potential hybrid approaches combining speculation with replication for improved throughput and stability.

Abstract

We provide a queueing-theoretic framework for job replication schemes based on the principle "\emph{replicate a job as soon as the system detects it as a \emph{straggler}}". This is called job \emph{speculation}. Recent works have analyzed {replication} on arrival, which we refer to as \emph{replication}. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the network, otherwise the job is terminated and relaunched or resumed at another server where it will complete. We provide a necessary and sufficient condition for the stability of speculative queueing networks with heterogeneous servers, general job sizes and scheduling disciplines. We find that speculation can increase the stability region of the network when compared with standard load balancing models and replication schemes. We provide general conditions under which timeouts increase the size of the stability region and derive a formula for the optimal speculation time, i.e., the timeout that minimizes the load induced through speculation. We compare speculation with redundant-$d$ and redundant-to-idle-queue-$d$ rules under an $S\& X$ model. For light loaded systems, redundancy schemes provide better response times. However, for moderate to heavy loadings, redundancy schemes can lose capacity and have markedly worse response times when compared with a speculative scheme.

Stability and Optimization of Speculative Queueing Networks

TL;DR

This paper develops a theoretical framework for speculative queueing networks that mitigate stragglers by timeouts and re-routing, distinguishing speculation from traditional replication. It establishes a fluid-model stability condition, showing that the system is positive Harris recurrent whenever the nominal per-queue loads satisfy , and derives criteria under which timeouts expand the stability region. It then designs the optimal timeout via an optimal stopping formulation, providing general conditions and a practical rule that reduces to under independence; comparisons with replication demonstrate that speculative load balancing often outperforms replication in moderate-to-heavy load regimes, with light loads favoring redundancy. The work also offers a large-system mean-field conjecture for mean response time and discusses practical considerations, including reinforcement-learning-based timeout design and potential hybrid approaches combining speculation with replication for improved throughput and stability.

Abstract

We provide a queueing-theoretic framework for job replication schemes based on the principle "\emph{replicate a job as soon as the system detects it as a \emph{straggler}}". This is called job \emph{speculation}. Recent works have analyzed {replication} on arrival, which we refer to as \emph{replication}. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the network, otherwise the job is terminated and relaunched or resumed at another server where it will complete. We provide a necessary and sufficient condition for the stability of speculative queueing networks with heterogeneous servers, general job sizes and scheduling disciplines. We find that speculation can increase the stability region of the network when compared with standard load balancing models and replication schemes. We provide general conditions under which timeouts increase the size of the stability region and derive a formula for the optimal speculation time, i.e., the timeout that minimizes the load induced through speculation. We compare speculation with redundant- and redundant-to-idle-queue- rules under an model. For light loaded systems, redundancy schemes provide better response times. However, for moderate to heavy loadings, redundancy schemes can lose capacity and have markedly worse response times when compared with a speculative scheme.

Paper Structure

This paper contains 18 sections, 6 theorems, 54 equations, 4 figures.

Key Result

Proposition 1

Assume that eq:FS1-eq:FSL hold. If the fluid model solutions is stable, then $X(t)$ is positive Harris recurrent.

Figures (4)

  • Figure 1: Speculative vs standard load balancing via $L(\tau)$, \ref{['eq:LR']}, under a number of $S\&X$ models.
  • Figure 2: Average response time obtained within Speculative Load Balancing (SLB), Cancel-on-Complete-$d$ (CoC-$d$) and Cancel-on-Start-$d$ (CoS-$d$) under $S\&X$ models. The vertical dashed black lines represent the limits of the stability region of SLB.
  • Figure 3: Average response time obtained within Speculative Load Balancing (SLB) and Redundant-to-Idle-Queue-$d$ (RIQ-$d$) under $S\&X$ models. The vertical dashed black lines represent the limits of the stability region of SLB.
  • Figure 4: Average response times obtained by simulation and via Conjecture \ref{['conjecture']}; $S$ has the form \ref{['S_bimodal']} and $N=50$.

Theorems & Definitions (14)

  • Remark 1
  • Definition 1
  • Proposition 1: Dai dai1995positive; Bramson bramson2008stability
  • Theorem 1
  • Remark 2
  • Proposition 2
  • Theorem 2
  • Remark 3
  • Theorem 3
  • Definition 2: Optimal Timeout
  • ...and 4 more