Table of Contents
Fetching ...

Learning and balancing unknown loads in large-scale systems

Diego Goldsztajn, Sem C. Borst, Johan S. H. van Leeuwaarden

TL;DR

The paper tackles load balancing across many server pools under unknown, time-varying demand. It combines a threshold-based inner dispatch rule with outer learning loops to track the offered load, proving that the threshold equilibrates on time intervals where the normalized load $\rho(t)$ is bounded, yielding near-perfect balance with an exponentially decaying tail. It introduces a novel non-fluid-limit methodology built on strong approximations and relative-compactness arguments to handle rapid threshold excursions and to analyze both time-varying exponential service times and Coxian service times. The results quantify how the balance improves as the threshold granularity $\Delta$ decreases and provide rigorous guarantees for both the refined learning scheme (exponential case) and the basic scheme (Coxian case), with numerical illustrations supporting the theoretical insights. The work has practical implications for scalable, online load balancing in large-scale streaming and online-gaming platforms where service times are heterogeneous and demand fluctuates in time.

Abstract

Consider a system of identical server pools where tasks with exponentially distributed service times arrive as a time-inhomogenenous Poisson process. An admission threshold is used in an inner control loop to assign incoming tasks to server pools while, in an outer control loop, a learning scheme adjusts this threshold over time to keep it aligned with the unknown offered load of the system. In a many-server regime, we prove that the learning scheme reaches an equilibrium along intervals of time where the normalized offered load per server pool is suitably bounded, and that this results in a balanced distribution of the load. Furthermore, we establish a similar result when tasks with Coxian distributed service times arrive at a constant rate and the threshold is adjusted using only the total number of tasks in the system. The novel proof technique developed in this paper, which differs from a traditional fluid limit analysis, allows to handle rapid variations of the first learning scheme, triggered by excursions of the occupancy process that have vanishing size. Moreover, our approach allows to characterize the asymptotic behavior of the system with Coxian distributed service times without relying on a fluid limit of a detailed state descriptor.

Learning and balancing unknown loads in large-scale systems

TL;DR

The paper tackles load balancing across many server pools under unknown, time-varying demand. It combines a threshold-based inner dispatch rule with outer learning loops to track the offered load, proving that the threshold equilibrates on time intervals where the normalized load is bounded, yielding near-perfect balance with an exponentially decaying tail. It introduces a novel non-fluid-limit methodology built on strong approximations and relative-compactness arguments to handle rapid threshold excursions and to analyze both time-varying exponential service times and Coxian service times. The results quantify how the balance improves as the threshold granularity decreases and provide rigorous guarantees for both the refined learning scheme (exponential case) and the basic scheme (Coxian case), with numerical illustrations supporting the theoretical insights. The work has practical implications for scalable, online load balancing in large-scale streaming and online-gaming platforms where service times are heterogeneous and demand fluctuates in time.

Abstract

Consider a system of identical server pools where tasks with exponentially distributed service times arrive as a time-inhomogenenous Poisson process. An admission threshold is used in an inner control loop to assign incoming tasks to server pools while, in an outer control loop, a learning scheme adjusts this threshold over time to keep it aligned with the unknown offered load of the system. In a many-server regime, we prove that the learning scheme reaches an equilibrium along intervals of time where the normalized offered load per server pool is suitably bounded, and that this results in a balanced distribution of the load. Furthermore, we establish a similar result when tasks with Coxian distributed service times arrive at a constant rate and the threshold is adjusted using only the total number of tasks in the system. The novel proof technique developed in this paper, which differs from a traditional fluid limit analysis, allows to handle rapid variations of the first learning scheme, triggered by excursions of the occupancy process that have vanishing size. Moreover, our approach allows to characterize the asymptotic behavior of the system with Coxian distributed service times without relying on a fluid limit of a detailed state descriptor.

Paper Structure

This paper contains 38 sections, 21 theorems, 131 equations, 3 figures.

Key Result

Theorem 2

Suppose that there exists $\gamma_0 \in (0, 1/ 2)$ such that If the offered load is $(m, \Delta)$-bounded on an interval $[a, b]$ and $\sigma(a, b, m, \Delta) < \sigma < b - a$, then there exist $c > 0$ and a set of probability one where the following statements hold: The constant $c$ can be taken equal to $u(a + \sigma)$.

Figures (3)

  • Figure 1: Schematic representation of the occupancy state of the system for exponentially distributed service times. White circles represent servers and black circles represent tasks. Each row corresponds to a server pool, and these are arranged so that the number of tasks increases from top to bottom. The number of tasks in column $i$ is $nq_n(i)$.
  • Figure 2: Numerical experiments with two different choices of $\Delta$. The evolution of the thresholds is plotted in (a), the occupancy state of the system with $\Delta = 3$ along the $3$-bounded interval $[3, 12]$ is depicted in (b) and the occupancy state of the system with $\Delta = 1$ along the $1$-bounded interval $[14, 23]$ is plotted in (c). Both experiments correspond to initially empty systems with $\mu = 1$, $n = 300$, $\alpha_n = 1 - 1 / n^{0.48}$ and the arrival rate function plotted in (a); the superscripts in the legends indicate the value of $\Delta$.
  • Figure 3: Schematic representation of the relations between the various results used to prove Theorem \ref{['the: main theorem exponential case']}; some intermediate results are omitted. An arrow connecting two results means that the first result is used to prove the second. The dotted boxes indicate the sections where the results are proven.

Theorems & Definitions (44)

  • Remark 1
  • Remark 2
  • Definition 1
  • Theorem 2
  • Remark 3
  • Theorem 3
  • Proposition 4
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 34 more