Table of Contents
Fetching ...

RCInvestigator: Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems

Shuhan Liu, Yunfan Zhou, Lu Ying, Yuan Tian, Jue Zhang, Shandan Zhou, Weiwei Cui, Qingwei Lin, Thomas Moscibroda, Haidong Zhang, Di Weng, Yingcai Wu

TL;DR

RCInvestigator proposes a human-machine visual analytics framework to tackle anomaly root cause analysis in cloud computing. It leverages a knowledge-graph representation of clues and reasoning logic, integrated with time-oriented visualizations and an interactive four-stage workflow (build, monitor, reason, conclude). The approach addresses modeling complex factor relations, scalable time-series exploration, and comprehensible result presentation, validated through two real-world use cases and expert interviews. Findings suggest improved efficiency, interpretability, and knowledge persistence for RCA in cloud environments.

Abstract

Finding the root causes of anomalies in cloud computing systems quickly is crucial to ensure availability and efficiency since accurate root causes can guide engineers to take appropriate actions to address the anomalies and maintain customer satisfaction. However, it is difficult to investigate and identify the root causes based on large-scale and high-dimension monitoring data collected from complex cloud computing environments. Due to the inherently dynamic characteristics of cloud computing systems, the existing approaches in practice largely rely on manual analyses for flexibility and reliability, but massive unpredictable factors and high data complexity make the process time-consuming. Despite recent advances in automated detection and investigation approaches, the speed and quality of root cause analyses remain limited by the lack of expert involvement in these approaches. The limitations found in the current solutions motivate us to propose a visual analytics approach that facilitates the interactive investigation of the anomaly root causes in cloud computing systems. We identified three challenges, namely, a) modeling databases for the root cause investigation, b) inferring root causes from large-scale time series, and c) building comprehensible investigation results. In collaboration with domain experts, we addressed these challenges with RCInvestigator, a novel visual analytics system that establishes a tight collaboration between human and machine and assists experts in investigating the root causes of cloud computing system anomalies. We evaluated the effectiveness of RCInvestigator through two use cases based on real-world data and received positive feedback from experts.

RCInvestigator: Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems

TL;DR

RCInvestigator proposes a human-machine visual analytics framework to tackle anomaly root cause analysis in cloud computing. It leverages a knowledge-graph representation of clues and reasoning logic, integrated with time-oriented visualizations and an interactive four-stage workflow (build, monitor, reason, conclude). The approach addresses modeling complex factor relations, scalable time-series exploration, and comprehensible result presentation, validated through two real-world use cases and expert interviews. Findings suggest improved efficiency, interpretability, and knowledge persistence for RCA in cloud environments.

Abstract

Finding the root causes of anomalies in cloud computing systems quickly is crucial to ensure availability and efficiency since accurate root causes can guide engineers to take appropriate actions to address the anomalies and maintain customer satisfaction. However, it is difficult to investigate and identify the root causes based on large-scale and high-dimension monitoring data collected from complex cloud computing environments. Due to the inherently dynamic characteristics of cloud computing systems, the existing approaches in practice largely rely on manual analyses for flexibility and reliability, but massive unpredictable factors and high data complexity make the process time-consuming. Despite recent advances in automated detection and investigation approaches, the speed and quality of root cause analyses remain limited by the lack of expert involvement in these approaches. The limitations found in the current solutions motivate us to propose a visual analytics approach that facilitates the interactive investigation of the anomaly root causes in cloud computing systems. We identified three challenges, namely, a) modeling databases for the root cause investigation, b) inferring root causes from large-scale time series, and c) building comprehensible investigation results. In collaboration with domain experts, we addressed these challenges with RCInvestigator, a novel visual analytics system that establishes a tight collaboration between human and machine and assists experts in investigating the root causes of cloud computing system anomalies. We evaluated the effectiveness of RCInvestigator through two use cases based on real-world data and received positive feedback from experts.
Paper Structure (25 sections, 2 equations, 12 figures)

This paper contains 25 sections, 2 equations, 12 figures.

Figures (12)

  • Figure 1: The existing RCA pipeline includes three stages, mainly relying on manual efforts. (A) First, analysts monitor the list of recent alerts and decide which one should be analyzed. (B) Second, analysts write query scripts to retrieve data and e-mail relevant teams for additional information. Then, analysts reason possible causes and gain insights. They repeat the process until the root cause is identified. (C) Finally, analysts write a summary.
  • Figure 2: The building board has a toolbar and a canvas. Users can create and edit a knowledge graph on the canvas. This is an example: "each cluster belongs to a zone". (A) an entity card (attributes and query), (B) a relation card (semantic and query), and (C) the query template entry.
  • Figure 3: This shows the metaphor mapping of our design. (A) is the investigation board in RCASleuth, while (B) is the investigation board in the real world. There are four types of typical elements, including clues, reasoning logic, annotations, and notes. In RCASleuth, we add the notes from the machine agent.
  • Figure 4: The filter card consists of two parts: (A) users can select filters and options in the selection panel; (B) the preview panel displays alternative groups of options and filtered attributes. For example, the PCP shows Errorcode-TypeError (3$^{rd}$) with OSType-Linux (2$^{nd}$).
  • Figure 5: The first and second steps of Case 1. (A) EA built a knowledge graph with 5 entities and 9 relations based on her domain knowledge, such as each region contains many Zones. (B) EA observed many incidents were related to Customer80 and happened in Area01-Zone02.
  • ...and 7 more figures