Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs
Yifan Wang, Kenneth P. Birman
TL;DR
This paper tackles root-cause analysis for cloud platform instability by introducing ARCA, a multimodal Retrieval-Augmented Generation LLM system that reasons across incident descriptions, logs, and telemetry. ARCA builds a structured knowledge base and employs a two-phase pipeline—a broad, multi-modal triage via ANN search and a guided LLM-based mitigation generation with human-in-the-loop validation. The experimental results demonstrate strong triage (≈92%) and mitigation accuracy (≈72%), plus effective log clustering on HPC datasets, indicating practical benefits for AI-Ops in dynamic cloud environments. The work highlights the value of cross-modal pattern matching and scalable knowledge bases for real-time incident response, with implications for cost, adaptability, and interpretability in NLU-assisted IT operations.
Abstract
Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.
