A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
Tingting Wang, Guilin Qi
TL;DR
This survey addresses root cause analysis in microservice-based cloud environments within the AIOps paradigm. It provides a comprehensive review of RCA methodologies across metrics, traces, logs, and multi-modal data, including graph-based, probabilistic, and deep learning approaches, as well as the emerging role of large language models. It analyzes evaluation criteria such as $MTTR$, Top $n$ Accuracy, MAR, Precision, Recall, and F1, while强调 interpretability and generalizability, and discusses data reliability and graph construction as core challenges. It highlights future directions toward real-time, cross-modal, explainable RCA with automated, scalable graph construction and AI-assisted tooling.
Abstract
The complex dependencies and propagative faults inherent in microservices, characterized by a dense network of interconnected services, pose significant challenges in identifying the underlying causes of issues. Prompt identification and resolution of disruptive problems are crucial to ensure rapid recovery and maintain system stability. Numerous methodologies have emerged to address this challenge, primarily focusing on diagnosing failures through symptomatic data. This survey aims to provide a comprehensive, structured review of root cause analysis (RCA) techniques within microservices, exploring methodologies that include metrics, traces, logs, and multi-model data. It delves deeper into the methodologies, challenges, and future trends within microservices architectures. Positioned at the forefront of AI and automation advancements, it offers guidance for future research directions.
