Table of Contents
Fetching ...

KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks

Tingting Wang, Guilin Qi, Tianxing Wu

TL;DR

KGroot tackles the RCA challenge in complex microservice environments by integrating event-driven knowledge graphs with graph convolutional networks. It defines a Fault Propagation Graph (FPG) to model real-time fault events and a Fault Event Knowledge Graph (FEKG) to encode historical fault patterns, using a Relational Graph Convolutional Network to measure similarity and rank candidate root causes. The approach achieves high accuracy (up to 93.5% top-3) and real-time performance on real and benchmark datasets, outperforming seven baselines and two degraded variants. This work offers a scalable, automated RCA solution suitable for production AIOps, with potential extensions to fault prediction and automated remediation.

Abstract

Fault localization is challenging in online micro-service due to the wide variety of monitoring data volume, types, events and complex interdependencies in service and components. Faults events in services are propagative and can trigger a cascade of alerts in a short period of time. In the industry, fault localization is typically conducted manually by experienced personnel. This reliance on experience is unreliable and lacks automation. Different modules present information barriers during manual localization, making it difficult to quickly align during urgent faults. This inefficiency lags stability assurance to minimize fault detection and repair time. Though actionable methods aimed to automatic the process, the accuracy and efficiency are less than satisfactory. The precision of fault localization results is of paramount importance as it underpins engineers trust in the diagnostic conclusions, which are derived from multiple perspectives and offer comprehensive insights. Therefore, a more reliable method is required to automatically identify the associative relationships among fault events and propagation path. To achieve this, KGroot uses event knowledge and the correlation between events to perform root cause reasoning by integrating knowledge graphs and GCNs for RCA. FEKG is built based on historical data, an online graph is constructed in real-time when a failure event occurs, and the similarity between each knowledge graph and online graph is compared using GCNs to pinpoint the fault type through a ranking strategy. Comprehensive experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level. This performance matches the level of real-time fault diagnosis in the industrial environment and significantly surpasses state-of-the-art baselines in RCA in terms of effectiveness and efficiency.

KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks

TL;DR

KGroot tackles the RCA challenge in complex microservice environments by integrating event-driven knowledge graphs with graph convolutional networks. It defines a Fault Propagation Graph (FPG) to model real-time fault events and a Fault Event Knowledge Graph (FEKG) to encode historical fault patterns, using a Relational Graph Convolutional Network to measure similarity and rank candidate root causes. The approach achieves high accuracy (up to 93.5% top-3) and real-time performance on real and benchmark datasets, outperforming seven baselines and two degraded variants. This work offers a scalable, automated RCA solution suitable for production AIOps, with potential extensions to fault prediction and automated remediation.

Abstract

Fault localization is challenging in online micro-service due to the wide variety of monitoring data volume, types, events and complex interdependencies in service and components. Faults events in services are propagative and can trigger a cascade of alerts in a short period of time. In the industry, fault localization is typically conducted manually by experienced personnel. This reliance on experience is unreliable and lacks automation. Different modules present information barriers during manual localization, making it difficult to quickly align during urgent faults. This inefficiency lags stability assurance to minimize fault detection and repair time. Though actionable methods aimed to automatic the process, the accuracy and efficiency are less than satisfactory. The precision of fault localization results is of paramount importance as it underpins engineers trust in the diagnostic conclusions, which are derived from multiple perspectives and offer comprehensive insights. Therefore, a more reliable method is required to automatically identify the associative relationships among fault events and propagation path. To achieve this, KGroot uses event knowledge and the correlation between events to perform root cause reasoning by integrating knowledge graphs and GCNs for RCA. FEKG is built based on historical data, an online graph is constructed in real-time when a failure event occurs, and the similarity between each knowledge graph and online graph is compared using GCNs to pinpoint the fault type through a ranking strategy. Comprehensive experiments demonstrate KGroot can locate the root cause with accuracy of 93.5% top 3 potential causes in second-level. This performance matches the level of real-time fault diagnosis in the industrial environment and significantly surpasses state-of-the-art baselines in RCA in terms of effectiveness and efficiency.
Paper Structure (13 sections, 3 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 13 sections, 3 equations, 4 figures, 5 tables, 2 algorithms.

Figures (4)

  • Figure 1: Workflow of KGroot
  • Figure 2: An example of event transformed to abstract event in alerting
  • Figure 3: An example of transforming the events of fault A to FEKG
  • Figure 4: Graph similarity computation model

Theorems & Definitions (2)

  • Definition 1
  • Definition 2