AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

Youcef Remil; Anes Bendimerad; Romain Mathonat; Mehdi Kaytoue

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

Youcef Remil, Anes Bendimerad, Romain Mathonat, Mehdi Kaytoue

TL;DR

This survey addresses the challenge of managing incidents in increasingly complex IT environments by proposing a unified AIOps-focused framework for incident management. It introduces six core AIOps abilities (Perception, Prevention, Detection, Location, Action, Interaction) and a structured, multi-layer incident-management workflow, complemented by a new taxonomy and data-management guidelines. The paper surveys a wide range of data-driven techniques across detection, prediction, prioritization, assignment, classification, deduplication, RCA, correlation, and mitigation, and catalogues publicly available datasets and benchmarks to enable reproducibility. By organizing contributions around data sources, types, tasks, and evaluation metrics, it identifies gaps (e.g., in assignment and deduplication) and highlights open challenges, including interpretability, trust, scalability, and need for industry-academia collaboration and open-source data/models for real-world impact.

Abstract

The management of modern IT systems poses unique challenges, necessitating scalability, reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on manual tasks and rule-based approaches, prove inefficient for the substantial data volumes and alerts generated by IT systems. Artificial Intelligence for Operating Systems (AIOps) has emerged as a solution, leveraging advanced analytics like machine learning and big data to enhance incident management. AIOps detects and predicts incidents, identifies root causes, and automates healing actions, improving quality and reducing operational costs. However, despite its potential, the AIOps domain is still in its early stages, decentralized across multiple sectors, and lacking standardized conventions. Research and industrial contributions are distributed without consistent frameworks for data management, target problems, implementation details, requirements, and capabilities. This study proposes an AIOps terminology and taxonomy, establishing a structured incident management procedure and providing guidelines for constructing an AIOps framework. The research also categorizes contributions based on criteria such as incident management tasks, application areas, data sources, and technical approaches. The goal is to provide a comprehensive review of technical and research aspects in AIOps for incident management, aiming to structure knowledge, identify gaps, and establish a foundation for future developments in the field.

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

TL;DR

Abstract

Paper Structure (30 sections, 14 figures, 27 tables)

This paper contains 30 sections, 14 figures, 27 tables.

Introduction
Context and Motivation
Focus of this Review: AIOps for Incident Management
Outline and Contributions
Streamlining Incident Management Procedure
Terminology and Definitions
Existing Maintenance Protocols in Incident Management
Target Maintenance Strata
Towards an Automated AIOps Solution for Incident Management
Pain points and Challenges
AIOps Framework for Data and Incident Management Procedure
Intelligent Incident Management Procedure Tasks
Desiderata for Effective Intelligent Incident Management
Proposed Taxonomy
Data Sources and Types
...and 15 more sections

Figures (14)

Figure 1: Exploring the research landscape of AIOps subareas with a focus on Incident Management.
Figure 2: Comprehensive chronological schema highlighting the distinctions and key connections among Faults, Bugs, Errors, Anomalies, Failures, and Outages.
Figure 3: Behavioral scheme of the different maintenance protocols. Adapted and improved from fink2020data.
Figure 4: Comprehensive AIOps reference architecture for Incident Management Procedure prasad2018marketibmAIOpsArchi
Figure 5: Data lakehouse reference architecture
...and 9 more figures

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

TL;DR

Abstract

AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review

Authors

TL;DR

Abstract

Table of Contents

Figures (14)