Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World
Lorena Poenaru-Olaru, Natalia Karpova, Luis Cruz, Jan Rellermeyer, Arie van Deursen
TL;DR
The paper tackles the challenge of keeping anomaly detectors in AIOps up-to-date as operational data drifts over time. It systematically compares three retraining paradigms (static, full-history, sliding window) and two frequencies (blind versus drift-detected informed retraining), across two real-world datasets (Yahoo S5 and NAB) using five unsupervised detectors (FFT, SR, PCI, LSTM-AE, SR-CNN) and a FEDD drift detector. Key findings show that advanced models (LSTM-AE and SR-CNN) outperform simpler methods, sliding-window retraining benefits time-domain detectors like LSTM-AE while full-history can help domain-transforming detectors like SR-SCNN, and drift-detection–based retraining can improve performance over static baselines though periodic retraining often yields the best results. The work demonstrates that drift-aware maintenance pipelines are feasible and beneficial for real-world AIOps deployments, while highlighting the need for more open datasets and better drift detectors to generalize beyond the studied domains.
Abstract
Anomaly detection techniques are essential in automating the monitoring of IT systems and operations. These techniques imply that machine learning algorithms are trained on operational data corresponding to a specific period of time and that they are continuously evaluated on newly emerging data. Operational data is constantly changing over time, which affects the performance of deployed anomaly detection models. Therefore, continuous model maintenance is required to preserve the performance of anomaly detectors over time. In this work, we analyze two different anomaly detection model maintenance techniques in terms of the model update frequency, namely blind model retraining and informed model retraining. We further investigate the effects of updating the model by retraining it on all the available data (full-history approach) and only the newest data (sliding window approach). Moreover, we investigate whether a data change monitoring tool is capable of determining when the anomaly detection model needs to be updated through retraining.
