Table of Contents
Fetching ...

Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure

Chaoyun Zhang, Randolph Yao, Si Qin, Ze Li, Shekhar Agrawal, Binit R. Mishra, Tri Tran, Minghua Ma, Qingwei Lin, Murali Chintalapati, Dongmei Zhang

TL;DR

Deoxys is introduced, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events and enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.

Abstract

The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance. After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.

Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure

TL;DR

Deoxys is introduced, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events and enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.

Abstract

The presence of unhealthy nodes in cloud infrastructure signals the potential failure of machines, which can significantly impact the availability and reliability of cloud services, resulting in negative customer experiences. Effectively addressing unhealthy node mitigation is therefore vital for sustaining cloud system performance. This paper introduces Deoxys, a causal inference engine tailored to recommending mitigation actions for unhealthy node in cloud systems to minimize virtual machine downtime and interruptions during unhealthy events. It employs double machine learning combined with causal forest to produce precise and reliable mitigation recommendations based solely on limited observational data collected from the historical unhealthy events. To enhance the causal inference model, Deoxys further incorporates a policy fallback mechanism based on model uncertainty and action overriding mechanisms to (i) improve the reliability of the system, and (ii) strike a good tradeoff between downtime reduction and resource utilization, thereby enhancing the overall system performance. After deploying Deoxys in a large-scale cloud infrastructure at Microsoft, our observations demonstrate that Deoxys significantly reduces average VM downtime by 53% compared to a legacy policy, while leading to 49.5% lower VM interruption rate. This substantial improvement enhances the reliability and stability of cloud platforms, resulting in a seamless customer experience.

Paper Structure

This paper contains 39 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The workflows of unhealthy events and mitigation.
  • Figure 2: The causal graph (a) and the manipulated causal graph (b) of the unhealthy node mitigation.
  • Figure 3: The two-stage DML framework.
  • Figure 4: The overall Deoxys system architecture.
  • Figure 5: The dynamic causal graph of an unhealthy event.
  • ...and 5 more figures