Table of Contents
Fetching ...

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

Tingting Wang, Guilin Qi

TL;DR

This survey addresses root cause analysis in microservice-based cloud environments within the AIOps paradigm. It provides a comprehensive review of RCA methodologies across metrics, traces, logs, and multi-modal data, including graph-based, probabilistic, and deep learning approaches, as well as the emerging role of large language models. It analyzes evaluation criteria such as $MTTR$, Top $n$ Accuracy, MAR, Precision, Recall, and F1, while强调 interpretability and generalizability, and discusses data reliability and graph construction as core challenges. It highlights future directions toward real-time, cross-modal, explainable RCA with automated, scalable graph construction and AI-assisted tooling.

Abstract

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.

A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends

TL;DR

This survey addresses root cause analysis in microservice-based cloud environments within the AIOps paradigm. It provides a comprehensive review of RCA methodologies across metrics, traces, logs, and multi-modal data, including graph-based, probabilistic, and deep learning approaches, as well as the emerging role of large language models. It analyzes evaluation criteria such as , Top Accuracy, MAR, Precision, Recall, and F1, while强调 interpretability and generalizability, and discusses data reliability and graph construction as core challenges. It highlights future directions toward real-time, cross-modal, explainable RCA with automated, scalable graph construction and AI-assisted tooling.

Abstract

The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.
Paper Structure (20 sections, 4 equations, 6 figures, 7 tables)

This paper contains 20 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Micro-service-framework of service X
  • Figure 2: Structure of this paper.
  • Figure 3: Metrics-traces-logs scope
  • Figure 4: Distributed service request trace
  • Figure 5: A framework of graph-based RCA
  • ...and 1 more figures