MTAD: Tools and Benchmarks for Multivariate Time Series Anomaly Detection
Jinyang Liu, Wenwei Gu, Zhuangbin Chen, Yichen Li, Yuxin Su, Michael R. Lyu
TL;DR
MTAD introduces a unified benchmark protocol for multivariate KPI anomaly detection, addressing fragmentation in evaluation by jointly measuring accuracy, salience, delay, and efficiency. It formalizes multivariate KPI anomaly detection with explicit input, labels, and outputs, and surveys 12 representative detectors (5 traditional ML, 7 DL). A novel salience metric complements accuracy to capture how clearly anomalies stand out in detector scores, and a reproducible Python toolkit standardizes experiments across five real-world datasets. Key findings include that deep learning methods do not always outperform traditional methods in accuracy, though they better detect long-lasting anomalies, while traditional methods often provide faster, earlier alerts; the toolkit supports reproducible comparisons and practical deployment guidance. The work offers a concrete path toward more reliable industrial adoption of KPI anomaly detectors by balancing performance with interpretability and operational constraints.
Abstract
Key Performance Indicators (KPIs) are essential time-series metrics for ensuring the reliability and stability of many software systems. They faithfully record runtime states to facilitate the understanding of anomalous system behaviors and provide informative clues for engineers to pinpoint the root causes. The unprecedented scale and complexity of modern software systems, however, make the volume of KPIs explode. Consequently, many traditional methods of KPI anomaly detection become impractical, which serves as a catalyst for the fast development of machine learning-based solutions in both academia and industry. However, there is currently a lack of rigorous comparison among these KPI anomaly detection methods, and re-implementation demands a non-trivial effort. Moreover, we observe that different works adopt independent evaluation processes with different metrics. Some of them may not fully reveal the capability of a model and some are creating an illusion of progress. To better understand the characteristics of different KPI anomaly detectors and address the evaluation issue, in this paper, we provide a comprehensive review and evaluation of twelve state-of-the-art methods, and propose a novel metric called salience. Particularly, the selected methods include five traditional machine learning-based methods and seven deep learning-based methods. These methods are evaluated with five multivariate KPI datasets that are publicly available. A unified toolkit with easy-to-use interfaces is also released. We report the benchmark results in terms of accuracy, salience, efficiency, and delay, which are of practical importance for industrial deployment. We believe our work can contribute as a basis for future academic research and industrial application.
