Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
Taeyoon Kim, Woohyeok Park, Hoyeong Yun, Kyungyong Lee
TL;DR
This work addresses the problem of why LLM-based RCA agents underperform in real-world cloud settings, despite modern model capabilities. It introduces a process-level analysis and a 12-pitfall taxonomy across intra-agent reasoning, inter-agent communication, and agent-environment interaction, applying it to 1,675 agent runs from five models on the OpenRCA benchmark. The key findings show that Hallucination in Interpretation and Incomplete Exploration dominate across models and stem from shared architectural design rather than model quality, with prompt engineering alone failing to fix these issues. Mitigations demonstrate that enriching inter-agent communication and implementing robust environment state management yield meaningful improvements in detection accuracy and efficiency, highlighting the need for architectural reforms for reliable autonomous cloud RCA. These insights provide a foundation for building more dependable RCA agents and point to practical directions such as verification modules and structured state sharing for future systems.
Abstract
Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.
