Table of Contents
Fetching ...

AI Assistants for Incident Lifecycle in a Microservice Environment: A Systematic Literature Review

Dahlia Ziqi Zhou, Marios Fokaefs

TL;DR

Incidents in microservice environments are costly and complex, driven by distributed interactions and large volumes of observability data. The authors conduct a SEGRESS-guided systematic literature review to map AI-assisted assistants across the incident lifecycle, data types, and methods, based on 31 primary studies (primarily 2023–2024). The findings show a strong focus on detection and containment using LLMs and deep learning, with logs and traces as key data sources, while Prepare and Post-Incident phases remain underexplored. The work highlights opportunities to broaden data sources, include more user-centered evaluations, and develop proactive incident-management tools that span the entire lifecycle in microservice ecosystems.

Abstract

Incidents in microservice environments can be costly and challenging to recover from due to their complexity and distributed nature. Recent advancements in artificial intelligence (AI) offer promising solutions for improving incident management. This paper systematically reviews primary studies on AI assistants designed to support different phases of the incident lifecycle. It highlights successful applications of AI, identifies gaps in current research, and suggests future opportunities for enhancing incident management through AI. By examining these studies, the paper aims to provide insights into the effectiveness of AI tools and their potential to address ongoing challenges in incident recovery.

AI Assistants for Incident Lifecycle in a Microservice Environment: A Systematic Literature Review

TL;DR

Incidents in microservice environments are costly and complex, driven by distributed interactions and large volumes of observability data. The authors conduct a SEGRESS-guided systematic literature review to map AI-assisted assistants across the incident lifecycle, data types, and methods, based on 31 primary studies (primarily 2023–2024). The findings show a strong focus on detection and containment using LLMs and deep learning, with logs and traces as key data sources, while Prepare and Post-Incident phases remain underexplored. The work highlights opportunities to broaden data sources, include more user-centered evaluations, and develop proactive incident-management tools that span the entire lifecycle in microservice ecosystems.

Abstract

Incidents in microservice environments can be costly and challenging to recover from due to their complexity and distributed nature. Recent advancements in artificial intelligence (AI) offer promising solutions for improving incident management. This paper systematically reviews primary studies on AI assistants designed to support different phases of the incident lifecycle. It highlights successful applications of AI, identifies gaps in current research, and suggests future opportunities for enhancing incident management through AI. By examining these studies, the paper aims to provide insights into the effectiveness of AI tools and their potential to address ongoing challenges in incident recovery.
Paper Structure (21 sections, 2 figures, 5 tables)

This paper contains 21 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: A Sankey Diagram of the Study Selection Process
  • Figure 2: The Publication Years of Selected Primary Studies