Learning Service Slowdown using Observational Data
Xu Kuang, Gal Mendelson
TL;DR
The paper tackles learning service slowdowns from observational congestion data in multi-server systems with adaptive congestion control. It shows that marginal congestion statistics can fail under diffusion-scale balancing and introduces a robust potential-action statistic, with theoretical guarantees: under maximally stable policies and heavy traffic, reliable slowdown detection is achievable via a relative-threshold rule; a central-limit theorem enables finite-time confidence, and normal-approximation bounds provide practical reliability estimates. Numerical results illustrate that potential-action signals closely track slowdown magnitudes and outperform marginal statistics, especially at moderate-to-high loads. The work suggests practitioners should combine multiple orthogonal statistics for reliable slowdown detection in complex, adaptive systems, with actionable online monitoring guidance.
Abstract
Being able to identify service slowdowns is crucial to many operational problems. We study how to use observational congestion data to learn service slowdown in a multi-server system that uses adaptive congestion control mechanisms. We show that a commonly used summary statistic that relies on the marginal congestion measured at individual servers can be highly inaccurate in the presence of adaptive congestion control. We propose a new statistic based on potential routing actions, and show it provides a much more robust signal for server slowdown in these settings. Unlike the marginal statistic, potential action aims to detect changes in the routing actions, and is able to uncover slowdowns even when they do not reflect in marginal congestion. Our results highlight the complexity in performing observational statistical analysis for service systems in the presence of adaptive congestion control. They also suggest that practitioners may want to combine multiple, orthogonal statistics to achieve reliable slowdown detection.
