Table of Contents
Fetching ...

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

Youcef Remil, Anes Bendimerad, Romain Mathonat, Mehdi Kaytoue

TL;DR

This survey addresses the challenge of managing incidents in increasingly complex IT environments by proposing a unified AIOps-focused framework for incident management. It introduces six core AIOps abilities (Perception, Prevention, Detection, Location, Action, Interaction) and a structured, multi-layer incident-management workflow, complemented by a new taxonomy and data-management guidelines. The paper surveys a wide range of data-driven techniques across detection, prediction, prioritization, assignment, classification, deduplication, RCA, correlation, and mitigation, and catalogues publicly available datasets and benchmarks to enable reproducibility. By organizing contributions around data sources, types, tasks, and evaluation metrics, it identifies gaps (e.g., in assignment and deduplication) and highlights open challenges, including interpretability, trust, scalability, and need for industry-academia collaboration and open-source data/models for real-world impact.

Abstract

The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

TL;DR

This survey addresses the challenge of managing incidents in increasingly complex IT environments by proposing a unified AIOps-focused framework for incident management. It introduces six core AIOps abilities (Perception, Prevention, Detection, Location, Action, Interaction) and a structured, multi-layer incident-management workflow, complemented by a new taxonomy and data-management guidelines. The paper surveys a wide range of data-driven techniques across detection, prediction, prioritization, assignment, classification, deduplication, RCA, correlation, and mitigation, and catalogues publicly available datasets and benchmarks to enable reproducibility. By organizing contributions around data sources, types, tasks, and evaluation metrics, it identifies gaps (e.g., in assignment and deduplication) and highlights open challenges, including interpretability, trust, scalability, and need for industry-academia collaboration and open-source data/models for real-world impact.

Abstract

The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.
Paper Structure (30 sections, 14 figures, 27 tables)

This paper contains 30 sections, 14 figures, 27 tables.

Figures (14)

  • Figure 1: Exploring the research landscape of AIOps subareas with a focus on Incident Management.
  • Figure 2: Comprehensive chronological schema highlighting the distinctions and key connections among Faults, Bugs, Errors, Anomalies, Failures, and Outages.
  • Figure 3: Behavioral scheme of the different maintenance protocols. Adapted and improved from fink2020data.
  • Figure 4: Comprehensive AIOps reference architecture for Incident Management Procedure prasad2018marketibmAIOpsArchi
  • Figure 5: Data lakehouse reference architecture
  • ...and 9 more figures