mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture
Wei Zhang, Hongcheng Guo, Jian Yang, Zhoujin Tian, Yi Zhang, Chaoran Yan, Zhoujun Li, Tongliang Li, Xu Shi, Liangfan Zheng, Bo Zhang
TL;DR
mABC tackles RCA in complex micro-services by combining seven specialized agents with an Agent Workflow and blockchain-inspired voting to mitigate LLM hallucination and prevent non-terminating loops. The framework orchestrates alert reception, data collection, dependency analysis, probabilistic fault assessment, and solution engineering under a decentralized, transparent governance model. Empirical results on the Train-Ticket benchmark and the AIOps dataset show superior root-cause identification and resolution generation, with ablation confirming the essential role of Agent Workflow, multi-agent collaboration, and voting. The approach offers a scalable, automated RCA solution for IT operations, with open-source code and datasets to facilitate adoption and further research.
Abstract
Root cause analysis (RCA) in Micro-services architecture (MSA) with escalating complexity encounters complex challenges in maintaining system stability and efficiency due to fault propagation and circular dependencies among nodes. Diverse root cause analysis faults require multi-agents with diverse expertise. To mitigate the hallucination problem of large language models (LLMs), we design blockchain-inspired voting to ensure the reliability of the analysis by using a decentralized decision-making process. To avoid non-terminating loops led by common circular dependency in MSA, we objectively limit steps and standardize task processing through Agent Workflow. We propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), where multiple agents based on the powerful LLMs follow Agent Workflow and collaborate in blockchain-inspired voting. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. Our experiments on the AIOps challenge dataset and a newly created Train-Ticket dataset demonstrate superior performance in identifying root causes and generating effective resolutions. The ablation study further highlights Agent Workflow, multi-agent, and blockchain-inspired voting is crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and significantly improves the IT Operation domain. The code and dataset are in https://github.com/zwpride/mABC.
