Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Xuchao Zhang; Supriyo Ghosh; Chetan Bansal; Rujia Wang; Minghua Ma; Yu Kang; Saravan Rajmohan

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, Saravan Rajmohan

TL;DR

This paper tackles automated root cause analysis (RCA) for large-scale cloud incidents by proposing an in-context learning (ICL) approach that obviates fine-tuning of vast LLMs like GPT-4. It builds a retrieval-augmented RCA framework: summarizing incidents, embedding via a sentence transformer, indexing with FAISS, and assembling top-$k$ in-context exemplars to prompt the LLM for root-cause generation. Across a dataset of over $101{,}308$ incidents from CompanyX, the GPT-4 with ICL outperforms fine-tuned GPT-3 by $\approx$24.8% on six metrics and by $\approx$49.7% over zero-shot, with human evaluators noting substantial gains in correctness ($+43.5\%$) and readability ($+8.7\%$). The study demonstrates the practicality of using vanilla GPT models for RCA, achieving high accuracy and interpretability without expensive fine-tuning, and highlights the importance of exemplar relevance and prompt design for real-world incident management.

Abstract

Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services, requiring on-call engineers to identify the primary issues and implement corrective actions to prevent future recurrences. Improving the incident RCA process is vital for minimizing service downtime, customer impact and manual toil. Recent advances in artificial intelligence have introduced state-of-the-art Large Language Models (LLMs) like GPT-4, which have proven effective in tackling various AIOps problems, ranging from code authoring to incident management. Nonetheless, the GPT-4 model's immense size presents challenges when trying to fine-tune it on user data because of the significant GPU resource demand and the necessity for continuous model fine-tuning with the emergence of new data. To address the high cost of fine-tuning LLM, we propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning. We conduct extensive study over 100,000 production incidents, comparing several large language models using multiple metrics. The results reveal that our in-context learning approach outperforms the previous fine-tuned large language models such as GPT-3 by an average of 24.8\% across all metrics, with an impressive 49.7\% improvement over the zero-shot model. Moreover, human evaluation involving actual incident owners demonstrates its superiority over the fine-tuned model, achieving a 43.5\% improvement in correctness and an 8.7\% enhancement in readability. The impressive results demonstrate the viability of utilizing a vanilla GPT model for the RCA task, thereby avoiding the high computational and maintenance costs associated with a fine-tuned model.

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

TL;DR

in-context exemplars to prompt the LLM for root-cause generation. Across a dataset of over

incidents from CompanyX, the GPT-4 with ICL outperforms fine-tuned GPT-3 by

24.8% on six metrics and by

49.7% over zero-shot, with human evaluators noting substantial gains in correctness (

) and readability (

). The study demonstrates the practicality of using vanilla GPT models for RCA, achieving high accuracy and interpretability without expensive fine-tuning, and highlights the importance of exemplar relevance and prompt design for real-world incident management.

Abstract

Paper Structure (34 sections, 9 figures, 6 tables)

This paper contains 34 sections, 9 figures, 6 tables.

Introduction
Background
Incident Root Cause Analysis
The Promise of LLMs
Research Questions
Methodology
Overall Architecture
Data Preparation
Data Collection
Data Cleaning
In-context Example Extraction
Incident Summarization
Retrieval Index Building
In-Context Examples Retrieval
Root Cause Generation
...and 19 more sections

Figures (9)

Figure 1: A sample production incident.
Figure 2: Overview of our In-context Learning RCA Framework
Figure 3: Example of Original and Summarized Incident
Figure 4: Summarization Prompt for Incident Summary and Root Cause
Figure 5: In-context Examples Prompting
...and 4 more figures

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

TL;DR

Abstract

Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4

Authors

TL;DR

Abstract

Table of Contents

Figures (9)