Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning

Lu Wang; Mayukh Das; Fangkai Yang; Junjie Sheng; Bo Qiao; Hang Dong; Si Qin; Victor Rühle; Chetan Bansal; Eli Cortez; Íñigo Goiri; Saravan Rajmohan; Qingwei Lin; Dongmei Zhang

Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning

Lu Wang, Mayukh Das, Fangkai Yang, Junjie Sheng, Bo Qiao, Hang Dong, Si Qin, Victor Rühle, Chetan Bansal, Eli Cortez, Íñigo Goiri, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

TL;DR

This work tackles risk-aware adaptive vCPU oversubscription in cloud environments by formulating it as prototypical imitation learning. The proposed ProtoHAIL framework learns oversubscription policies by discovering representative usage prototypes, aligning actions with prototype-based references, and integrating active human-in-the-loop feedback to refine prototypes and mitigate risk. Key contributions include a novel prototype-based IL approach, an efficient HITL training loop, and extensive evaluations on Microsoft internal cloud data and a semi-synthetic airline overbooking domain, showing reduced risk and increased utilization (e.g., saved cores) across scenarios. The approach offers interpretable, adaptable policies with generalization potential to other oversubscription problems beyond cloud services.

Abstract

Oversubscription is a prevalent practice in cloud services where the system offers more virtual resources, such as virtual cores in virtual machines, to users or applications than its available physical capacity for reducing revenue loss due to unused/redundant capacity. While oversubscription can potentially lead to significant enhancement in efficient resource utilization, the caveat is that it comes with the risks of overloading and introducing jitter at the level of physical nodes if all the co-located virtual machines have high utilization. Thus suitable oversubscription policies which maximize utilization while mitigating risks are paramount for cost-effective seamless cloud experiences. Most cloud platforms presently rely on static heuristics-driven decisions about oversubscription activation and limits, which either leads to overloading or stranded resources. Designing an intelligent oversubscription policy that can adapt to resource utilization patterns and jointly optimizes benefits and risks is, largely, an unsolved problem. We address this challenge with our proposed novel HuMan-in-the-loop Protoypical Imitation Learning (ProtoHAIL) framework that exploits approximate symmetries in utilization patterns to learn suitable policies. Also, our human-in-the-loop (knowledge-infused) training allows for learning safer policies that are robust to noise and sparsity. Our empirical investigations on real data show orders of magnitude reduction in risk and significant increase in benefits (saving stranded cores) in Microsoft cloud platform for 1st party (internal services).

Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning

TL;DR

Abstract

Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (2)