Table of Contents
Fetching ...

CHASE: A Causal Hypergraph based Framework for Root Cause Analysis in Multimodal Microservice Systems

Ziming Zhao, Zhenwei Wang, Tiehua Zhang, Zhishu Shen, Hai Dong, Zhen Lei, Xingjun Ma, Gaowei Xu, Zhijun Ding, Yun Yang

TL;DR

CHASE introduces a causal hypergraph framework for root cause analysis in multimodal microservice systems. It fuses traces, logs, and metrics into a multimodal invocation graph and applies heterogeneous message passing for instance-level anomaly detection, then uses hypergraph convolution to capture multivariate causality for root-cause localization. The approach outperforms state-of-the-art baselines on GAIA and AIOps 2020 datasets, with substantial improvements in top-k accuracy. The work advances end-to-end RCA by modeling long-range causality and multimodal information flow in dynamic microservice traces, with practical implications for reliability engineering.

Abstract

In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies in enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.

CHASE: A Causal Hypergraph based Framework for Root Cause Analysis in Multimodal Microservice Systems

TL;DR

CHASE introduces a causal hypergraph framework for root cause analysis in multimodal microservice systems. It fuses traces, logs, and metrics into a multimodal invocation graph and applies heterogeneous message passing for instance-level anomaly detection, then uses hypergraph convolution to capture multivariate causality for root-cause localization. The approach outperforms state-of-the-art baselines on GAIA and AIOps 2020 datasets, with substantial improvements in top-k accuracy. The work advances end-to-end RCA by modeling long-range causality and multimodal information flow in dynamic microservice traces, with practical implications for reliability engineering.

Abstract

In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies in enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
Paper Structure (20 sections, 17 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 17 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: An exemplary service invocations in the microservice system with multimodal data, including trace, monitoring metrics and log data
  • Figure 2: Overall framework of CHASE
  • Figure 3: Sensitivity analysis: (a) number of attention layers; (b) different positional encoding; (c) hidden dimension; (d) number of causality layers
  • Figure 4: Causal weights visualization