mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale
Xiaona Zhou, Constantin Brif, Ismini Lourentzou
TL;DR
The paper tackles the challenge of selecting effective anomaly detectors for multivariate time series by introducing mTSBench, the largest benchmark to date with 344 labeled series from 19 datasets across 12 domains. It evaluates 24 detectors (including LLM-based approaches) and 3 unsupervised model selectors within a unified framework that reports 13 anomaly-detection metrics and 3 ranking metrics, plus a dataset-level performance matrix for selection. The results show no detector consistently dominates across datasets and current selectors fall short of optimal baselines, with substantial domain- and dataset-specific variability. These findings underscore the need for robust, domain-aware model selection and point toward future directions that integrate meta-learning, continual learning, and foundation-model-based approaches to improve cross-domain generalization and reliability.
Abstract
Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.
