Table of Contents
Fetching ...

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

Cheng-Ming Lin, Ching Chang, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih Peng

TL;DR

This work tackles root cause analysis in microservice environments by addressing the missing temporal context in traditional causal discovery. It introduces RUN, a three-stage framework that uses self-supervised contrastive learning to capture contextual information from time series, neural Granger causal discovery to construct a causal graph, and Pagerank with personalization to rank root causes. The approach demonstrates clear performance gains over baselines on sock-shop data and competitive results on synthetic data, highlighting the importance of temporal dynamics in identifying causal relations. The findings have practical impact for SREs by enabling faster and more accurate pinpointing of root causes in complex, time-evolving microservice systems, with potential for scaling to larger deployments.

Abstract

In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.

Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

TL;DR

This work tackles root cause analysis in microservice environments by addressing the missing temporal context in traditional causal discovery. It introduces RUN, a three-stage framework that uses self-supervised contrastive learning to capture contextual information from time series, neural Granger causal discovery to construct a causal graph, and Pagerank with personalization to rank root causes. The approach demonstrates clear performance gains over baselines on sock-shop data and competitive results on synthetic data, highlighting the importance of temporal dynamics in identifying causal relations. The findings have practical impact for SREs by enabling faster and more accurate pinpointing of root causes in complex, time-evolving microservice systems, with potential for scaling to larger deployments.

Abstract

In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.
Paper Structure (24 sections, 6 equations, 8 figures, 3 tables)

This paper contains 24 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of causal structure discovery-based techniques for RCA.
  • Figure 2: Illustrations of contextual information from the same timestamps but with different contexts. The red window and green window respectively represent two distinct types of contextual information. The same timestamp with different contexts should be close.
  • Figure 3: An illustrated issue of negative pair selection.
  • Figure 4: Overview of our proposed framework, RUN, consisting of three stages: 1) Maximizing the positive pair to capture the contextual information; 2) Neural Granger causal discovery to derive the causal graph from multivariate time series; and 3) The diagnosis stage infers the root cause from the obtained causal graph.
  • Figure 5: Overview of time series forecasting. There are $N$ independent neural networks for each time series $i$ to predict their causal relationships.
  • ...and 3 more figures