Table of Contents
Fetching ...

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova

TL;DR

PATH tackles the scarcity of realistic, high-quality benchmarks for online anomaly detection in multivariate time series by introducing a large, discrete-sequence dataset generated from a physics-based automotive powertrain simulator. The dataset captures dynamic, variable-state behavior and supports unsupervised, semi-supervised, and forecasting tasks, with anomalies introduced via pre-simulation model changes to ensure realism. Baseline experiments compare classical and deep learning approaches, revealing substantial gains when using semi-supervised settings and the critical role of threshold choice in online detection. The work emphasizes reproducibility and provides a foundation for future domain expansion and improved online evaluation metrics to better reflect practical anomaly detection performance.

Abstract

Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.

PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series

TL;DR

PATH tackles the scarcity of realistic, high-quality benchmarks for online anomaly detection in multivariate time series by introducing a large, discrete-sequence dataset generated from a physics-based automotive powertrain simulator. The dataset captures dynamic, variable-state behavior and supports unsupervised, semi-supervised, and forecasting tasks, with anomalies introduced via pre-simulation model changes to ensure realism. Baseline experiments compare classical and deep learning approaches, revealing substantial gains when using semi-supervised settings and the critical role of threshold choice in online detection. The work emphasizes reproducibility and provides a foundation for future domain expansion and improved online evaluation metrics to better reflect practical anomaly detection performance.

Abstract

Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets. Current publicly available datasets are too small, not diverse and feature trivial anomalies, which hinders measurable progress in this research area. We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools that reflects realistic behaviour of an automotive powertrain, including its multivariate, dynamic and variable-state properties. Additionally, our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature. To cater for both unsupervised and semi-supervised anomaly detection settings, as well as time series generation and forecasting, we make different versions of the dataset available, where training and test subsets are offered in contaminated and clean versions, depending on the task. We also provide baseline results from a selection of approaches based on deterministic and variational autoencoders, as well as a non-parametric approach. As expected, the baseline experimentation shows that the approaches trained on the semi-supervised version of the dataset outperform their unsupervised counterparts, highlighting a need for approaches more robust to contaminated training data. Furthermore, results show that the threshold used can have a large influence on detection performance, hence more work needs to be invested in methods to find a suitable threshold without the need for labelled data.

Paper Structure

This paper contains 13 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Simplified schematic of the FEV model used for the generation of the PATH dataset. Numbers represent the indices of signal flow, reference is shown in Table \ref{['tab:signals']}.
  • Figure 2: A more detailed schematic of the vehicle model depicted in Figure \ref{['fig:sim_model']}. Numbers represent the indices of signal flow, reference is shown in Table \ref{['tab:signals']}. Output signals of the vehicle model, which are not fed back into other subsystems, are not shown, for simplicity.
  • Figure 3: Sample plot of a nominal sequence with added noise and undergone trimming. The channel legend can be found in Table \ref{['tab:signals']}.
  • Figure 4: Plot of an anomalous sequence without regenerative braking (in red) and its control counterpart (in black), both with added noise and undergone trimming. The anomalous sub-sequence starts after 384.6. The channel legend can be found in Table \ref{['tab:signals']}.
  • Figure 5: Plot of an anomalous sequence with an added headwind (in red) and its control counterpart (in black), both with added noise and undergone trimming. The anomalous sub-sequence starts after 738.0. The channel legend can be found in Table \ref{['tab:signals']}.
  • ...and 5 more figures