Table of Contents
Fetching ...

BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems

Tao Duan, Runqing Chen, Pinghui Wang, Junzhou Zhao, Jiongzhou Liu, Shujie Han, Yi Liu, Fan Xu

TL;DR

This work presents BSODiag, an unsupervised, lightweight framework for diagnosing batch servers outage in large-scale cloud infrastructure. It fuses multi-source failure data (alerts, incidents, changes) through a dedicated MFD module, learns failure correlations via historical knowledge graphs, and performs global root-cause localization plus propagation-path inference with ORCA. Key contributions include a three-module architecture, Apriori-based failure correlation mining, and MAPR-based root-cause localization with PPI for interpretable pathways. Experiments on Alibaba Cloud data show BSODiag outperforms baselines in root-cause localization (PR@1/2/3 and MAP) and provides explainable propagation paths, while achieving rapid diagnosis (~$24.5$ seconds). The approach offers practical impact by enabling faster, more reliable troubleshooting and reducing outage downtime in large cloud systems.

Abstract

Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight diagnosis framework for batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failure information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.

BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems

TL;DR

This work presents BSODiag, an unsupervised, lightweight framework for diagnosing batch servers outage in large-scale cloud infrastructure. It fuses multi-source failure data (alerts, incidents, changes) through a dedicated MFD module, learns failure correlations via historical knowledge graphs, and performs global root-cause localization plus propagation-path inference with ORCA. Key contributions include a three-module architecture, Apriori-based failure correlation mining, and MAPR-based root-cause localization with PPI for interpretable pathways. Experiments on Alibaba Cloud data show BSODiag outperforms baselines in root-cause localization (PR@1/2/3 and MAP) and provides explainable propagation paths, while achieving rapid diagnosis (~ seconds). The approach offers practical impact by enabling faster, more reliable troubleshooting and reducing outage downtime in large cloud systems.

Abstract

Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight diagnosis framework for batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failure information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.

Paper Structure

This paper contains 28 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: The life cycle of a batch servers outage diagnosis.
  • Figure 2: Different failure monitoring data collected in cloud infrastructure system.
  • Figure 3: Analysis of root causes and failure correlations.
  • Figure 4: Analysis of real-world troubleshooting process
  • Figure 5: The overview of BSODiag
  • ...and 3 more figures