COIN: Chance-Constrained Imitation Learning for Uncertainty-aware Adaptive Resource Oversubscription Policy
Lu Wang, Mayukh Das, Fangkai Yang, Chao Duo, Bo Qiao, Hang Dong, Si Qin, Chetan Bansal, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
TL;DR
Coin tackles uncertainty in telemetry by introducing chance-constrained imitation learning to learn oversubscription policies that balance resource efficiency and congestion risk. It transforms the stochastic constraint into a deterministic form under Gaussian assumptions, uses a backward value function for satisfiability estimation, and employs ensemble value learning to capture variance in cost values. A safety-layer policy update projects actions to satisfy the constraint, while the policy is trained via imitation loss; experiments across cloud and airline domains show approximately 3-4× improvements in efficiency and safety over baselines. The approach yields robust, offline-learned policies that are practical for real systems, enabling adaptive oversubscription with probabilistic safety guarantees.
Abstract
We address the challenge of learning safe and robust decision policies in presence of uncertainty in context of the real scientific problem of adaptive resource oversubscription to enhance resource efficiency while ensuring safety against resource congestion risk. Traditional supervised prediction or forecasting models are ineffective in learning adaptive policies whereas standard online optimization or reinforcement learning is difficult to deploy on real systems. Offline methods such as imitation learning (IL) are ideal since we can directly leverage historical resource usage telemetry. But, the underlying aleatoric uncertainty in such telemetry is a critical bottleneck. We solve this with our proposed novel chance-constrained imitation learning framework, which ensures implicit safety against uncertainty in a principled manner via a combination of stochastic (chance) constraints on resource congestion risk and ensemble value functions. This leads to substantial ($\approx 3-4\times$) improvement in resource efficiency and safety in many oversubscription scenarios, including resource management in cloud services.
