Table of Contents
Fetching ...

A Survey of AIOps for Failure Management in the Era of Large Language Models

Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S. Yu, Ying Li

TL;DR

This paper presents a comprehensive survey of AIOps technology for failure management in the LLM era, which includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps.

Abstract

As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.

A Survey of AIOps for Failure Management in the Era of Large Language Models

TL;DR

This paper presents a comprehensive survey of AIOps technology for failure management in the LLM era, which includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps.

Abstract

As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.
Paper Structure (30 sections, 7 equations, 7 figures, 1 table)

This paper contains 30 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Data Source of AIOps for Failure Management
  • Figure 2: LLM-based Approaches for AIOps
  • Figure 3: AIOps Tasks for Failure Management (taxonomy of this survey)
  • Figure 4: Log-based Failure Perception and Root Cause Analysis: The Common Workflow
  • Figure 5: Workflow of Various Anomaly Detection Approaches: Prediction-based, Reconstruction-based, and Classification-based
  • ...and 2 more figures