PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Lucas Correia; Jan-Christoph Goos; Thomas Bäck; Anna V. Kononova

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova

TL;DR

PATH tackles the scarcity of realistic, high-quality benchmarks for online anomaly detection in multivariate time series by introducing a large, discrete-sequence dataset generated from a physics-based automotive powertrain simulator. The dataset captures dynamic, variable-state behavior and supports unsupervised, semi-supervised, and forecasting tasks, with anomalies introduced via pre-simulation model changes to ensure realism. Baseline experiments compare classical and deep learning approaches, revealing substantial gains when using semi-supervised settings and the critical role of threshold choice in online detection. The work emphasizes reproducibility and provides a foundation for future domain expansion and improved online evaluation metrics to better reflect practical anomaly detection performance.

Abstract

Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

TL;DR

Abstract

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)