A Survey of AIOps for Failure Management in the Era of Large Language Models

Lingzhe Zhang; Tong Jia; Mengxi Jia; Yifan Wu; Aiwei Liu; Yong Yang; Zhonghai Wu; Xuming Hu; Philip S. Yu; Ying Li

A Survey of AIOps for Failure Management in the Era of Large Language Models

Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S. Yu, Ying Li

TL;DR

This paper presents a comprehensive survey of AIOps technology for failure management in the LLM era, which includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps.

Abstract

As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.

A Survey of AIOps for Failure Management in the Era of Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 7 equations, 7 figures, 1 table)

This paper contains 30 sections, 7 equations, 7 figures, 1 table.

Introduction
Why are LLMs Beneficial for AIOps on Failure Management?
Why a Survey of AIOps for Failure Management in the Era of LLMs?
Preliminary
Data Source for AIOps
LLM-based Approaches for AIOps
AIOps Tasks for Failure Management
Data Preprocessing
Log Parsing
Metrics Imputation
Input Summarization
Failure Perception
Failure Prediction
Anomaly Detection
Root Cause Analysis
...and 15 more sections

Figures (7)

Figure 1: Data Source of AIOps for Failure Management
Figure 2: LLM-based Approaches for AIOps
Figure 3: AIOps Tasks for Failure Management (taxonomy of this survey)
Figure 4: Log-based Failure Perception and Root Cause Analysis: The Common Workflow
Figure 5: Workflow of Various Anomaly Detection Approaches: Prediction-based, Reconstruction-based, and Classification-based
...and 2 more figures

A Survey of AIOps for Failure Management in the Era of Large Language Models

TL;DR

Abstract

A Survey of AIOps for Failure Management in the Era of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)