Table of Contents
Fetching ...

mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Xiaona Zhou, Constantin Brif, Ismini Lourentzou

TL;DR

The paper tackles the challenge of selecting effective anomaly detectors for multivariate time series by introducing mTSBench, the largest benchmark to date with 344 labeled series from 19 datasets across 12 domains. It evaluates 24 detectors (including LLM-based approaches) and 3 unsupervised model selectors within a unified framework that reports 13 anomaly-detection metrics and 3 ranking metrics, plus a dataset-level performance matrix for selection. The results show no detector consistently dominates across datasets and current selectors fall short of optimal baselines, with substantial domain- and dataset-specific variability. These findings underscore the need for robust, domain-aware model selection and point toward future directions that integrate meta-learning, continual learning, and foundation-model-based approaches to improve cross-domain generalization and reliability.

Abstract

Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.

mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

TL;DR

The paper tackles the challenge of selecting effective anomaly detectors for multivariate time series by introducing mTSBench, the largest benchmark to date with 344 labeled series from 19 datasets across 12 domains. It evaluates 24 detectors (including LLM-based approaches) and 3 unsupervised model selectors within a unified framework that reports 13 anomaly-detection metrics and 3 ranking metrics, plus a dataset-level performance matrix for selection. The results show no detector consistently dominates across datasets and current selectors fall short of optimal baselines, with substantial domain- and dataset-specific variability. These findings underscore the need for robust, domain-aware model selection and point toward future directions that integrate meta-learning, continual learning, and foundation-model-based approaches to improve cross-domain generalization and reliability.

Abstract

Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.

Paper Structure

This paper contains 18 sections, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Average AUC-ROC ($\uparrow$) Performance of 24 Anomaly Detection Methods ($x$-axis) Evaluated Across 19 mTSBench Datasets ($y$-axis). The substantial performance variability across datasets highlights the need for robust model selection strategies. mTSBench benchmarks the capability of model selection techniques to systematically identify the optimal anomaly detection method among 24 state-of-the-art detectors evaluated on a comprehensive collection of 344 multivariate time series.
  • Figure 2: mTSBench Overview. mTSBench is the largest and most diverse benchmark for multivariate time series anomaly detection and model selection, spanning 19 multivariate time series datasets across various application domains and establishing a platform for robust anomaly detection and adaptive model selection in real-world multivariate contexts. mTSBench's comprehensive evaluation suite and diverse collection of state-of-the-art anomaly detectors, including statistical, deep learning, and LLM-based approaches, facilitates standardized comparison of model selection strategies.
  • Figure 3: Comparison of Anomaly Detection Methods Across Five Evaluation Metrics. The boxplots illustrate the distribution of performance scores for each method evaluated over all mTSBench datasets, measured using VUS-PR, VUS-ROC, AUC-PR, AUC-ROC, and AUC-PTRT. Detectors are ordered by their average VUS-PR score. Boxes represent interquartile ranges, with solid lines indicating the median and dashed lines indicating the mean. In-depth analysis in \ref{['sec:adr']}.
  • Figure 4: Model Selection Performance Grouped by Time Series Dimensionality. VUS-PR (left) and AUC-ROC (right) for three dimensionality groups ($<$10, 10--30, $>$30). Discussion is in \ref{['sec:msp']}
  • Figure 5: Ranking Comparison of Model Selection Methods. (Left) Precision$@$3, Recall$@$3, and NDCG$@$3 for each method. (Right) Recall$@k$ as a function of $k$. Discussion is in \ref{['sec:rank']}.
  • ...and 8 more figures