Learning and balancing unknown loads in large-scale systems
Diego Goldsztajn, Sem C. Borst, Johan S. H. van Leeuwaarden
TL;DR
The paper tackles load balancing across many server pools under unknown, time-varying demand. It combines a threshold-based inner dispatch rule with outer learning loops to track the offered load, proving that the threshold equilibrates on time intervals where the normalized load $\rho(t)$ is bounded, yielding near-perfect balance with an exponentially decaying tail. It introduces a novel non-fluid-limit methodology built on strong approximations and relative-compactness arguments to handle rapid threshold excursions and to analyze both time-varying exponential service times and Coxian service times. The results quantify how the balance improves as the threshold granularity $\Delta$ decreases and provide rigorous guarantees for both the refined learning scheme (exponential case) and the basic scheme (Coxian case), with numerical illustrations supporting the theoretical insights. The work has practical implications for scalable, online load balancing in large-scale streaming and online-gaming platforms where service times are heterogeneous and demand fluctuates in time.
Abstract
Consider a system of identical server pools where tasks with exponentially distributed service times arrive as a time-inhomogenenous Poisson process. An admission threshold is used in an inner control loop to assign incoming tasks to server pools while, in an outer control loop, a learning scheme adjusts this threshold over time to keep it aligned with the unknown offered load of the system. In a many-server regime, we prove that the learning scheme reaches an equilibrium along intervals of time where the normalized offered load per server pool is suitably bounded, and that this results in a balanced distribution of the load. Furthermore, we establish a similar result when tasks with Coxian distributed service times arrive at a constant rate and the threshold is adjusted using only the total number of tasks in the system. The novel proof technique developed in this paper, which differs from a traditional fluid limit analysis, allows to handle rapid variations of the first learning scheme, triggered by excursions of the occupancy process that have vanishing size. Moreover, our approach allows to characterize the asymptotic behavior of the system with Coxian distributed service times without relying on a fluid limit of a detailed state descriptor.
