Table of Contents
Fetching ...

Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs

Yifan Wang, Kenneth P. Birman

TL;DR

This paper tackles root-cause analysis for cloud platform instability by introducing ARCA, a multimodal Retrieval-Augmented Generation LLM system that reasons across incident descriptions, logs, and telemetry. ARCA builds a structured knowledge base and employs a two-phase pipeline—a broad, multi-modal triage via ANN search and a guided LLM-based mitigation generation with human-in-the-loop validation. The experimental results demonstrate strong triage (≈92%) and mitigation accuracy (≈72%), plus effective log clustering on HPC datasets, indicating practical benefits for AI-Ops in dynamic cloud environments. The work highlights the value of cross-modal pattern matching and scalable knowledge bases for real-time incident response, with implications for cost, adaptability, and interpretability in NLU-assisted IT operations.

Abstract

Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.

Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs

TL;DR

This paper tackles root-cause analysis for cloud platform instability by introducing ARCA, a multimodal Retrieval-Augmented Generation LLM system that reasons across incident descriptions, logs, and telemetry. ARCA builds a structured knowledge base and employs a two-phase pipeline—a broad, multi-modal triage via ANN search and a guided LLM-based mitigation generation with human-in-the-loop validation. The experimental results demonstrate strong triage (≈92%) and mitigation accuracy (≈72%), plus effective log clustering on HPC datasets, indicating practical benefits for AI-Ops in dynamic cloud environments. The work highlights the value of cross-modal pattern matching and scalable knowledge bases for real-time incident response, with implications for cost, adaptability, and interpretability in NLU-assisted IT operations.

Abstract

Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.

Paper Structure

This paper contains 20 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: ARCA workflow in its building and query phases.
  • Figure 2: t-SNE (t-distributed stochastic neighbor embedding) of the embedded log content. The x- and y-axes show the coordinates in the t-SNE embedding space.
  • Figure 3: Accuracy of ARCA-PoC. The x-axis represents the output size of the similarity search. The left y-axis shows the triage accuracy, while the right y-axis shows the system accuracy.
  • Figure 4: Cost analysis of using ARCA-PoC. The x-axis represents the output size of the similarity search. The left y-axis shows the average cost of a single query in US cents, while the right y-axis shows its average time consumption.
  • Figure 5: Prompt for LLM to process log file.
  • ...and 3 more figures