Table of Contents
Fetching ...

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, Zheshun Wu

TL;DR

MicroHECL tackles root-cause localization for availability issues in large-scale microservice systems by dynamically constructing a service call graph and performing anomaly-propagation-chain analysis. It introduces dedicated detectors for performance, reliability, and traffic anomalies (OC-SVM for RT, Random Forest for EC, and 3-sigma-based QPS with corroborating correlation checks), along with a pruning strategy based on edge-trend similarity to keep analysis tractable. Root causes are ranked by the absolute Pearson correlation between the candidate's quality metric and the initial business-impact metric, enabling fast, precise recommendations. In Alibaba deployments, MicroHECL achieves a top-3 hit ratio of $HR@3 = 0.68$ and reduces localization time to $76$ seconds on average, validating substantial practical impact and scalability for real-world large-scale microservice systems.

Abstract

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

TL;DR

MicroHECL tackles root-cause localization for availability issues in large-scale microservice systems by dynamically constructing a service call graph and performing anomaly-propagation-chain analysis. It introduces dedicated detectors for performance, reliability, and traffic anomalies (OC-SVM for RT, Random Forest for EC, and 3-sigma-based QPS with corroborating correlation checks), along with a pruning strategy based on edge-trend similarity to keep analysis tractable. Root causes are ranked by the absolute Pearson correlation between the candidate's quality metric and the initial business-impact metric, enabling fast, precise recommendations. In Alibaba deployments, MicroHECL achieves a top-3 hit ratio of and reduces localization time to seconds on average, validating substantial practical impact and scalability for real-world large-scale microservice systems.

Abstract

Availability issues of industrial microservice systems (e.g., drop of successfully placed orders and processed transactions) directly affect the running of the business. These issues are usually caused by various types of service anomalies which propagate along service dependencies. Accurate and high-efficient root cause localization is thus a critical challenge for large-scale industrial microservice systems. Existing approaches use service dependency graph based analysis techniques to automatically locate root causes. However, these approaches are limited due to their inaccurate detection of service anomalies and inefficient traversing of service dependency graph. In this paper, we propose a high-efficient root cause localization approach for availability issues of microservice systems, called MicroHECL. Based on a dynamically constructed service call graph, MicroHECL analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. We combine machine learning and statistical methods and design customized models for the detection of different types of service anomalies (i.e., performance, reliability, traffic). To improve the efficiency, we adopt a pruning strategy to eliminate irrelevant service calls in anomaly propagation chain analysis. Experimental studies show that MicroHECL significantly outperforms two state-of-the-art baseline approaches in terms of both accuracy and efficiency. MicroHECL has been used in Alibaba and achieves a top-3 hit ratio of 68% with root cause localization time reduced from 30 minutes to 5 minutes.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: MicroHECL Overview
  • Figure 2: Anomaly Propagation Chain Analysis Process
  • Figure 3: Fluctuation of Quality Metrics
  • Figure 4: Detection Time Changes with Num of Nodes
  • Figure 5: Evaluation of the Effect of the Pruning Strategy