Table of Contents
Fetching ...

MTAD: Tools and Benchmarks for Multivariate Time Series Anomaly Detection

Jinyang Liu, Wenwei Gu, Zhuangbin Chen, Yichen Li, Yuxin Su, Michael R. Lyu

TL;DR

MTAD introduces a unified benchmark protocol for multivariate KPI anomaly detection, addressing fragmentation in evaluation by jointly measuring accuracy, salience, delay, and efficiency. It formalizes multivariate KPI anomaly detection with explicit input, labels, and outputs, and surveys 12 representative detectors (5 traditional ML, 7 DL). A novel salience metric complements accuracy to capture how clearly anomalies stand out in detector scores, and a reproducible Python toolkit standardizes experiments across five real-world datasets. Key findings include that deep learning methods do not always outperform traditional methods in accuracy, though they better detect long-lasting anomalies, while traditional methods often provide faster, earlier alerts; the toolkit supports reproducible comparisons and practical deployment guidance. The work offers a concrete path toward more reliable industrial adoption of KPI anomaly detectors by balancing performance with interpretability and operational constraints.

Abstract

Key Performance Indicators (KPIs) are essential time-series metrics for ensuring the reliability and stability of many software systems. They faithfully record runtime states to facilitate the understanding of anomalous system behaviors and provide informative clues for engineers to pinpoint the root causes. The unprecedented scale and complexity of modern software systems, however, make the volume of KPIs explode. Consequently, many traditional methods of KPI anomaly detection become impractical, which serves as a catalyst for the fast development of machine learning-based solutions in both academia and industry. However, there is currently a lack of rigorous comparison among these KPI anomaly detection methods, and re-implementation demands a non-trivial effort. Moreover, we observe that different works adopt independent evaluation processes with different metrics. Some of them may not fully reveal the capability of a model and some are creating an illusion of progress. To better understand the characteristics of different KPI anomaly detectors and address the evaluation issue, in this paper, we provide a comprehensive review and evaluation of twelve state-of-the-art methods, and propose a novel metric called salience. Particularly, the selected methods include five traditional machine learning-based methods and seven deep learning-based methods. These methods are evaluated with five multivariate KPI datasets that are publicly available. A unified toolkit with easy-to-use interfaces is also released. We report the benchmark results in terms of accuracy, salience, efficiency, and delay, which are of practical importance for industrial deployment. We believe our work can contribute as a basis for future academic research and industrial application.

MTAD: Tools and Benchmarks for Multivariate Time Series Anomaly Detection

TL;DR

MTAD introduces a unified benchmark protocol for multivariate KPI anomaly detection, addressing fragmentation in evaluation by jointly measuring accuracy, salience, delay, and efficiency. It formalizes multivariate KPI anomaly detection with explicit input, labels, and outputs, and surveys 12 representative detectors (5 traditional ML, 7 DL). A novel salience metric complements accuracy to capture how clearly anomalies stand out in detector scores, and a reproducible Python toolkit standardizes experiments across five real-world datasets. Key findings include that deep learning methods do not always outperform traditional methods in accuracy, though they better detect long-lasting anomalies, while traditional methods often provide faster, earlier alerts; the toolkit supports reproducible comparisons and practical deployment guidance. The work offers a concrete path toward more reliable industrial adoption of KPI anomaly detectors by balancing performance with interpretability and operational constraints.

Abstract

Key Performance Indicators (KPIs) are essential time-series metrics for ensuring the reliability and stability of many software systems. They faithfully record runtime states to facilitate the understanding of anomalous system behaviors and provide informative clues for engineers to pinpoint the root causes. The unprecedented scale and complexity of modern software systems, however, make the volume of KPIs explode. Consequently, many traditional methods of KPI anomaly detection become impractical, which serves as a catalyst for the fast development of machine learning-based solutions in both academia and industry. However, there is currently a lack of rigorous comparison among these KPI anomaly detection methods, and re-implementation demands a non-trivial effort. Moreover, we observe that different works adopt independent evaluation processes with different metrics. Some of them may not fully reveal the capability of a model and some are creating an illusion of progress. To better understand the characteristics of different KPI anomaly detectors and address the evaluation issue, in this paper, we provide a comprehensive review and evaluation of twelve state-of-the-art methods, and propose a novel metric called salience. Particularly, the selected methods include five traditional machine learning-based methods and seven deep learning-based methods. These methods are evaluated with five multivariate KPI datasets that are publicly available. A unified toolkit with easy-to-use interfaces is also released. We report the benchmark results in terms of accuracy, salience, efficiency, and delay, which are of practical importance for industrial deployment. We believe our work can contribute as a basis for future academic research and industrial application.
Paper Structure (30 sections, 3 equations, 7 figures, 3 tables)

This paper contains 30 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An Example of Point Adjustment
  • Figure 2: Prediction Results of LSTM on SMD Dataset (Machine-3-2)
  • Figure 3: Number of True/False Positive Predictions v.s. Different Threshold: Because the true positive predictions in Anomaly A dominate the false positive ones, the adjusted F1 scores do not decrease significantly according to different thresholds.
  • Figure 4: Prediction Results of OmniAnomaly and LSTM on SMD Dataset (Machine-3-9): Both of the methods successfully predict higher anomaly scores within the anomalous area. However, OmniAnomaly is more practical because the anomaly scores are more obvious.
  • Figure 5: An Example of Salience Computation
  • ...and 2 more figures