RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Zefan Wang; Zichuan Liu; Yingying Zhang; Aoxiao Zhong; Jihong Wang; Fengbin Yin; Lunting Fan; Lingfei Wu; Qingsong Wen

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, Qingsong Wen

TL;DR

This work tackles cloud root cause analysis under strict privacy constraints by deploying an internal LLM agent, RCAgent, that autonomously gathers data, analyzes it with expert tools, and interacts with the environment. It introduces a comprehensive framework including Observation Snapshot Key, JSON-based tool interfaces, code and log expert agents, stabilization mechanisms, and trajectory-level Self-Consistency to reliably produce root-cause, solution, evidence, and responsibility predictions. Across offline and online evaluations in Alibaba Cloud's real-time Flink platform, RCAgent outperforms ReAct on all RCA facets, with substantial gains in METEOR, BLEURT, and BARTScore, and robust stability under noisy data and OoD conditions; its deployment also demonstrates practical scalability and production impact. The approach emphasizes privacy-preserving, tool-augmented autonomy for industrial RCA, validated by ablations showing the necessity of LLM experts, JsonRegen, OBSK, and Self-Consistency, and it marks a significant step toward real-world adoption of LLM-based autonomous RCA in cloud environments.

Abstract

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent's evident and consistent superiority over ReAct across all aspects of RCA -- predicting root causes, solutions, evidence, and responsibilities -- and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

TL;DR

Abstract

Paper Structure (42 sections, 5 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 5 figures, 5 tables, 1 algorithm.

Introduction
Challenge
Privacy
Context Length
Action Validity
Methodology
Observation Snapshot Key
Tool Preparation
Information-gathering Tools
Analytical Tools
Code analysis tool.
Log analysis tool.
Stabilization
JSON Repairing
Error Handling
...and 27 more sections

Figures (5)

Figure 1: Overview of the action cycles from RCAgent. The cycle involves generating verbal thoughts, taking actions, and receiving observation from the environment, all of which are recorded in the prompt alongside the initial memory to boost reasoning. Besides, RCAgent includes the key-value store for observation retrieval, allowing the agent to operate on lengthy text data. After parsing the action, RCAgent executes it directly or invokes an expert agent, depending on its type.
Figure 2: Code analysis tool in RCAgent.
Figure 3: Trajectory-level Self-Consistency. Every Step in RCAgent means a sequential procedure of thought, action, and observation.
Figure 4: Performance of Self-Consistency at different scales and methods. The solid line is the mean score, and the shade represents the standard deviation. The score is calculated on the concatenated solution and root cause.
Figure 5: Performance and resource consumption at different data scales.

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

TL;DR

Abstract

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)