Table of Contents
Fetching ...

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

Luan Pham, Huong Ha, Hongyu Zhang

TL;DR

BARO addresses the challenge of robust root cause analysis in microservice systems by integrating Multivariate Bayesian Online Change Point Detection for online anomaly detection with RobustScorer, a nonparametric hypothesis tester, for RCA. It operates without supervision or predefined causal graphs, making it scalable and robust to imprecise anomaly timing. Empirical evaluation on Online Boutique, Sock Shop, and Train Ticket shows BARO consistently outperforms state-of-the-art baselines in both anomaly detection and RCA, with particularly strong performance on large-scale systems and in scenarios with uncertain anomaly times. The approach offers practical impact by delivering fast, unsupervised, and robust RCA suitable for evolving microservice environments, and its components show clear contributions and resilience across datasets and parameter settings.

Abstract

Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

TL;DR

BARO addresses the challenge of robust root cause analysis in microservice systems by integrating Multivariate Bayesian Online Change Point Detection for online anomaly detection with RobustScorer, a nonparametric hypothesis tester, for RCA. It operates without supervision or predefined causal graphs, making it scalable and robust to imprecise anomaly timing. Empirical evaluation on Online Boutique, Sock Shop, and Train Ticket shows BARO consistently outperforms state-of-the-art baselines in both anomaly detection and RCA, with particularly strong performance on large-scale systems and in scenarios with uncertain anomaly times. The approach offers practical impact by delivering fast, unsupervised, and robust RCA suitable for evolving microservice environments, and its components show clear contributions and resilience across datasets and parameter settings.

Abstract

Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.
Paper Structure (48 sections, 6 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overview: The monitoring system monitors the microservice system and collects the time series data. Our BARO consists of two components: Multivariate BOCPD and RobustScorer. The Multivariate BOCPD acts as an anomaly detection module to continuously check whether there is an anomaly. If there exists an anomaly, it triggers RobustScorer to score and rank the root cause services and metrics correspondingly.
  • Figure 2: An example of using Multivariate BOCPD to detect change points on multivariate time series data. Dotted vertical red lines indicate change points. We observe that Multivariate BOCPD can provide the anomaly detection time (the first change point) accurately that separate the normal and abnormal period. In the abnormal period there are multiple change points due to the failure propagation chain.
  • Figure 3: The Robustness of RobustScorer against imprecise anomaly detection time. In (a), an early anomaly detection time reduces the number of data points used to compute the distribution of the normal data in the hypothesis test. Median and IQR show greater resilience to a limited data setting compared to mean and standard deviation. In (b), a delayed anomaly detection time includes abnormal data (outliers) into the normal period. Median and IQR also show robustness to these outliers better than mean and standard deviation.
  • Figure 4: Overview of our setup for microservice systems.
  • Figure 5: The performance of N-Sigma, $\epsilon$-Diagnosis, CIRCA, RCD, and BARO w.r.t. different values of $t_{\text{bias}}$ on the Online Boutique dataset. The figure presents the AC@1, AC@3, and Avg@5 scores from left to right.
  • ...and 1 more figures