BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems
Tao Duan, Runqing Chen, Pinghui Wang, Junzhou Zhao, Jiongzhou Liu, Shujie Han, Yi Liu, Fan Xu
TL;DR
This work presents BSODiag, an unsupervised, lightweight framework for diagnosing batch servers outage in large-scale cloud infrastructure. It fuses multi-source failure data (alerts, incidents, changes) through a dedicated MFD module, learns failure correlations via historical knowledge graphs, and performs global root-cause localization plus propagation-path inference with ORCA. Key contributions include a three-module architecture, Apriori-based failure correlation mining, and MAPR-based root-cause localization with PPI for interpretable pathways. Experiments on Alibaba Cloud data show BSODiag outperforms baselines in root-cause localization (PR@1/2/3 and MAP) and provides explainable propagation paths, while achieving rapid diagnosis (~$24.5$ seconds). The approach offers practical impact by enabling faster, more reliable troubleshooting and reducing outage downtime in large cloud systems.
Abstract
Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight diagnosis framework for batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failure information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.
