Table of Contents
Fetching ...

McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions

Lorena Poenaru-Olaru, Luis Cruz, Jan Rellermeyer, Arie van Deursen

TL;DR

Concept drift drives AIOps model aging, making periodic retraining costly due to labeling requirements. McUDI offers a model-centric, unsupervised drift detector that uses mean-decrease-in-impurity feature importance to focus on influential features and applies a Kolmogorov-Smirnov test on their distributions, signaling retraining needs with a 0.05 significance threshold. In experiments on Google Cluster Traces and Backblaze Disk Stats, McUDI achieves comparable predictive performance to periodic retraining while substantially reducing label costs (approximately 260k for disk and 30k for job) and reducing retraining frequency, outperforming KS in false-alarm reduction. The work delivers a practical maintenance pipeline for AIOps that enables label-efficient, drift-aware retraining and public reproducibility, with promising applicability to other tree-based models and domains.

Abstract

Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.

McUDI: Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions

TL;DR

Concept drift drives AIOps model aging, making periodic retraining costly due to labeling requirements. McUDI offers a model-centric, unsupervised drift detector that uses mean-decrease-in-impurity feature importance to focus on influential features and applies a Kolmogorov-Smirnov test on their distributions, signaling retraining needs with a 0.05 significance threshold. In experiments on Google Cluster Traces and Backblaze Disk Stats, McUDI achieves comparable predictive performance to periodic retraining while substantially reducing label costs (approximately 260k for disk and 30k for job) and reducing retraining frequency, outperforming KS in false-alarm reduction. The work delivers a practical maintenance pipeline for AIOps that enables label-efficient, drift-aware retraining and public reproducibility, with promising applicability to other tree-based models and domains.

Abstract

Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.
Paper Structure (29 sections, 4 equations, 5 figures, 3 tables)

This paper contains 29 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Obtaining the ground truth. Pipeline to assess the presence of concept drift between two batches (Data P1 and Data P2).
  • Figure 2: Difference between retraining periodically and retraining based on drift detection.
  • Figure 3: Label cost-efficient maintenance pipeline including McUDI for failure prediction models.
  • Figure 4: Ground Truth. Batches that contain drift and non-drift for disk and job datasets.
  • Figure 5: Percentage of features that change in each period and their corresponding batch label (drift/non-drift).