Table of Contents
Fetching ...

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

Yuhan Zhu, Jian Wang, Bing Li, Xuxian Tang, Hao Li, Neng Zhang, Yuqi Zhao

TL;DR

Root-cause localization in cloud-edge microservice systems is challenged by cross-segment latency, network instability, and dynamic hybrid deployments. The authors propose MicroCERCL, which combines kernel-level log parsing, anomaly detection, and a graph neural network operating over a heterogeneous dynamic topology stack to infer root-cause probabilities without historical failure data. The approach demonstrates substantial accuracy gains over baselines across three hybrid benchmarks and shows robustness to noise, with a practical runtime. The work provides an open-source benchmark and code to facilitate replication and future research in cloud-edge failure analysis.

Abstract

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.

Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

TL;DR

Root-cause localization in cloud-edge microservice systems is challenged by cross-segment latency, network instability, and dynamic hybrid deployments. The authors propose MicroCERCL, which combines kernel-level log parsing, anomaly detection, and a graph neural network operating over a heterogeneous dynamic topology stack to infer root-cause probabilities without historical failure data. The approach demonstrates substantial accuracy gains over baselines across three hybrid benchmarks and shows robustness to noise, with a practical runtime. The work provides an open-source benchmark and code to facilitate replication and future research in cloud-edge failure analysis.

Abstract

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.
Paper Structure (31 sections, 4 equations, 9 figures, 4 tables)

This paper contains 31 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Root cause localization in the hybrid-deployed cloud-edge collaborative environment.
  • Figure 2: Influence of latency over different deployment topologies.
  • Figure 3: Three topologies of service deployment in the cloud-edge collaborative environment.
  • Figure 4: An example of the hybrid deployment with direct and indirect dependencies.
  • Figure 5: Overall framework of MicroCERCL.
  • ...and 4 more figures