On the Model Update Strategies for Supervised Learning in AIOps Solutions

Yingzhe Lyu; Heng Li; Zhen Ming; Jiang; Ahmed E. Hassan

On the Model Update Strategies for Supervised Learning in AIOps Solutions

Yingzhe Lyu, Heng Li, Zhen Ming, Jiang, Ahmed E. Hassan

TL;DR

This paper investigates how to update supervised learning models in AIOps systems amid evolving operation data. Through a case-study on Google Cluster Trace, Backblaze Disk Stats, and Alibaba GPU Cluster Trace, it compares stationary, periodic retraining, concept-drift guided retraining, time-based ensembles, and online learning across multiple models using metrics for performance ($AUC$), updating cost (EC), and stability. Findings show that active update strategies improve performance and stability over stationary models, with concept-drift guided retraining often matching or approaching periodic retraining while reducing retraining frequency, and time-based ensembles offering strong gains in certain scenarios but incurring higher testing costs. The results provide practical guidance for practitioners to balance model performance, maintenance effort, and latency requirements, and point to directions for more efficient drift detection and ensemble methods in AIOps. A replication package is provided to facilitate reproducibility and further research.

Abstract

AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced during the operation of large-scale systems and machine learning models to assist software engineers in their system operations. As operation data produced in the field are constantly evolving due to factors such as the changing operational environment and user base, the models in AIOps solutions need to be constantly maintained after deployment. While prior works focus on innovative modeling techniques to improve the performance of AIOps models before releasing them into the field, when and how to update AIOps models remain an under-investigated topic. In this work, we performed a case study on three large-scale public operation data and empirically assessed five different types of model update strategies for supervised learning regarding their performance, updating cost, and stability. We observed that active model update strategies (e.g., periodical retraining, concept drift guided retraining, time-based model ensembles, and online learning) achieve better and more stable performance than a stationary model. Particularly, applying sophisticated model update strategies could provide better performance, efficiency, and stability than simply retraining AIOps models periodically. In addition, we observed that, although some update strategies can save model training time, they significantly sacrifice model testing time, which could hinder their applications in AIOps solutions where the operation data arrive at high pace and volume and where immediate inferences are required. Our findings highlight that practitioners should consider the evolution of operation data and actively maintain AIOps models over time. Our observations can also guide researchers and practitioners in investigating more efficient and effective model update strategies that fit in the context of AIOps.

On the Model Update Strategies for Supervised Learning in AIOps Solutions

TL;DR

), updating cost (EC), and stability. Findings show that active update strategies improve performance and stability over stationary models, with concept-drift guided retraining often matching or approaching periodic retraining while reducing retraining frequency, and time-based ensembles offering strong gains in certain scenarios but incurring higher testing costs. The results provide practical guidance for practitioners to balance model performance, maintenance effort, and latency requirements, and point to directions for more efficient drift detection and ensemble methods in AIOps. A replication package is provided to facilitate reproducibility and further research.

Abstract

Paper Structure (37 sections, 2 equations, 10 figures, 11 tables, 1 algorithm)

This paper contains 37 sections, 2 equations, 10 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Prior research on AIOps solutions
Prior research on dealing with data evolution
Concept drift detection and mitigation
Time-based ensemble models
Online learning models
Case Study Subjects and Preliminary Study
Case Study Subjects
Google Cluster Trace Dataset
Backblaze Disk Stats Dataset
Alibaba GPU Cluster Trace Dataset
Preliminary Study
Approach
Results
...and 22 more sections

Figures (10)

Figure 1: Data schema for our studied datasets. Each colored box represents a data table: a line of the table name followed by lines describing the data fields. For the Google and Alibaba datasets, each table (e.g., machine_events) is one or multiple CSV files containing the fields described in the box. For the Backblaze dataset, the tables represent the logical view, while the physical data is stored as daily snapshots of each disk's attributes.
Figure 2: Number of samples in different time periods of the studied datasets.
Figure 3: Failure rates in different time periods of the studied datasets.
Figure 4: Statistical difference of dependent variables in different time periods of the studied datasets. The symbols in each cell indicate the statistical significance of the failure rate difference: (blank) $p\ge 0.05$; * $p < 0.05$; ** $p<0.01$; *** $p<0.001$. The color indicates the effect size of the failure rate difference using $\delta$: Negligible, $\delta < 0.147$; Small: $\delta < 0.33$; Medium: $\delta < 0.474$; Large: $\delta \ge 0.474$.
Figure 5: Illustration of different strategies for maintaining AIOps models. The illustration for the "retraining approach" represents both the periodical retraining and concept drift guided retraining strategies.
...and 5 more figures

On the Model Update Strategies for Supervised Learning in AIOps Solutions

TL;DR

Abstract

On the Model Update Strategies for Supervised Learning in AIOps Solutions

Authors

TL;DR

Abstract

Table of Contents

Figures (10)