Table of Contents
Fetching ...

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

Wei Zhang, Hongcheng Guo, Jian Yang, Zhoujin Tian, Yi Zhang, Chaoran Yan, Zhoujun Li, Tongliang Li, Xu Shi, Liangfan Zheng, Bo Zhang

TL;DR

mABC tackles RCA in complex micro-services by combining seven specialized agents with an Agent Workflow and blockchain-inspired voting to mitigate LLM hallucination and prevent non-terminating loops. The framework orchestrates alert reception, data collection, dependency analysis, probabilistic fault assessment, and solution engineering under a decentralized, transparent governance model. Empirical results on the Train-Ticket benchmark and the AIOps dataset show superior root-cause identification and resolution generation, with ablation confirming the essential role of Agent Workflow, multi-agent collaboration, and voting. The approach offers a scalable, automated RCA solution for IT operations, with open-source code and datasets to facilitate adoption and further research.

Abstract

Root cause analysis (RCA) in Micro-services architecture (MSA) with escalating complexity encounters complex challenges in maintaining system stability and efficiency due to fault propagation and circular dependencies among nodes. Diverse root cause analysis faults require multi-agents with diverse expertise. To mitigate the hallucination problem of large language models (LLMs), we design blockchain-inspired voting to ensure the reliability of the analysis by using a decentralized decision-making process. To avoid non-terminating loops led by common circular dependency in MSA, we objectively limit steps and standardize task processing through Agent Workflow. We propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), where multiple agents based on the powerful LLMs follow Agent Workflow and collaborate in blockchain-inspired voting. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. Our experiments on the AIOps challenge dataset and a newly created Train-Ticket dataset demonstrate superior performance in identifying root causes and generating effective resolutions. The ablation study further highlights Agent Workflow, multi-agent, and blockchain-inspired voting is crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and significantly improves the IT Operation domain. The code and dataset are in https://github.com/zwpride/mABC.

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

TL;DR

mABC tackles RCA in complex micro-services by combining seven specialized agents with an Agent Workflow and blockchain-inspired voting to mitigate LLM hallucination and prevent non-terminating loops. The framework orchestrates alert reception, data collection, dependency analysis, probabilistic fault assessment, and solution engineering under a decentralized, transparent governance model. Empirical results on the Train-Ticket benchmark and the AIOps dataset show superior root-cause identification and resolution generation, with ablation confirming the essential role of Agent Workflow, multi-agent collaboration, and voting. The approach offers a scalable, automated RCA solution for IT operations, with open-source code and datasets to facilitate adoption and further research.

Abstract

Root cause analysis (RCA) in Micro-services architecture (MSA) with escalating complexity encounters complex challenges in maintaining system stability and efficiency due to fault propagation and circular dependencies among nodes. Diverse root cause analysis faults require multi-agents with diverse expertise. To mitigate the hallucination problem of large language models (LLMs), we design blockchain-inspired voting to ensure the reliability of the analysis by using a decentralized decision-making process. To avoid non-terminating loops led by common circular dependency in MSA, we objectively limit steps and standardize task processing through Agent Workflow. We propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), where multiple agents based on the powerful LLMs follow Agent Workflow and collaborate in blockchain-inspired voting. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. Our experiments on the AIOps challenge dataset and a newly created Train-Ticket dataset demonstrate superior performance in identifying root causes and generating effective resolutions. The ablation study further highlights Agent Workflow, multi-agent, and blockchain-inspired voting is crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and significantly improves the IT Operation domain. The code and dataset are in https://github.com/zwpride/mABC.
Paper Structure (45 sections, 5 equations, 26 figures, 9 tables)

This paper contains 45 sections, 5 equations, 26 figures, 9 tables.

Figures (26)

  • Figure 1: Example of root cause analysis in MSA. Each node corresponds to a specific service in the system (e.g., login, register). Edge B$\to$I represents that service I relies on the information provided by service B. Alert event arises on node A while alert event root cause node is I with fault propagation path I$\to$G$\to$D$\to$A where a challenge circular dependency of H$\to$E$\to$L$\to$H.
  • Figure 2: Overview of mABC. Overall pipeline encapsulates the flow from alert inception to root cause analysis within mABC. 1) An alert event arises due to access function blockages or monitoring system alarms in MSA. 2) Alert Receiver ($\mathscr{A}_{1}$) forwards and chooses the alert event with the highest priority. 3) Process Scheduler ($\mathscr{A}_{2}$) divides unfinished root cause analyses into sub-tasks, handled by Data Detective ($\mathscr{A}_{3}$), Dependency Explorer ($\mathscr{A}_{4}$), Probability Oracle ($\mathscr{A}_{5}$), and Fault Mapper ($\mathscr{A}_{6}$) for various requests. 4) Solution Engineer ($\mathscr{A}_{7}$) develops resolutions for the root cause referencing previous successful cases.
  • Figure 3: Two distinct workflows of agent.
  • Figure 4: Vote process on Agent Chain
  • Figure 5: Determining the priority of an alert on node A based on time, urgency, and impact scope for Alert Receiver ($\mathscr{A}_{1}$).
  • ...and 21 more figures