Table of Contents
Fetching ...

X-lifecycle Learning for Cloud Incident Management using LLMs

Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, Saravan Rajmohan

TL;DR

The paper addresses cloud incident management challenges where crucial context is scattered across stages of the SDLC. It introduces X-lifecycle prompt augmentation to LLMs, incorporating upstream dependencies and service properties to improve root-cause analysis and monitor categorization. Using Microsoft IC3 data (353 incidents, 260 monitors), the approach yields significant gains over state-of-the-art methods, with human evaluation validating improved readability and relevance. The work demonstrates the practical value of cross-lifecycle data in helping on-call engineers resolve incidents more quickly and reliably, and it outlines directions for broader adoption and future extensions to other failure types.

Abstract

Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.

X-lifecycle Learning for Cloud Incident Management using LLMs

TL;DR

The paper addresses cloud incident management challenges where crucial context is scattered across stages of the SDLC. It introduces X-lifecycle prompt augmentation to LLMs, incorporating upstream dependencies and service properties to improve root-cause analysis and monitor categorization. Using Microsoft IC3 data (353 incidents, 260 monitors), the approach yields significant gains over state-of-the-art methods, with human evaluation validating improved readability and relevance. The work demonstrates the practical value of cross-lifecycle data in helping on-call engineers resolve incidents more quickly and reliably, and it outlines directions for broader adoption and future extensions to other failure types.

Abstract

Incident management for large cloud services is a complex and tedious process and requires significant amount of manual efforts from on-call engineers (OCEs). OCEs typically leverage data from different stages of the software development lifecycle [SDLC] (e.g., codes, configuration, monitor data, service properties, service dependencies, trouble-shooting documents, etc.) to generate insights for detection, root causing and mitigating of incidents. Recent advancements in large language models [LLMs] (e.g., ChatGPT, GPT-4, Gemini) created opportunities to automatically generate contextual recommendations to the OCEs assisting them to quickly identify and mitigate critical issues. However, existing research typically takes a silo-ed view for solving a certain task in incident management by leveraging data from a single stage of SDLC. In this paper, we demonstrate that augmenting additional contextual data from different stages of SDLC improves the performance of two critically important and practically challenging tasks: (1) automatically generating root cause recommendations for dependency failure related incidents, and (2) identifying ontology of service monitors used for automatically detecting incidents. By leveraging 353 incident and 260 monitor dataset from Microsoft, we demonstrate that augmenting contextual information from different stages of the SDLC improves the performance over State-of-The-Art methods.
Paper Structure (33 sections, 7 figures, 5 tables)

This paper contains 33 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Incident management lifecycle.
  • Figure 2: A sample monitor snapshot.
  • Figure 3: A sample dependency failure incident.
  • Figure 4: Prompt to summarize the Service Descriptions
  • Figure 5: Prompt with Incident Information, In-context Examples, and Upstream Service Dependency Details (InC DEP)
  • ...and 2 more figures