Table of Contents
Fetching ...

TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models

Haotian Si, Jianhui Li, Changhua Pei, Hang Cui, Jingwen Yang, Yongqian Sun, Shenglin Zhang, Jingjing Li, Haiming Zhang, Jing Han, Dan Pei, Gaogang Xie

TL;DR

An industrial-grade benchmark TimeSeriesBench is proposed and the performance of existing algorithms across more than 168 evaluation settings are assessed and comprehensive analysis for the future design of anomaly detection algorithms is provided.

Abstract

Time series anomaly detection (TSAD) has gained significant attention due to its real-world applications to improve the stability of modern software systems. However, there is no effective way to verify whether they can meet the requirements for real-world deployment. Firstly, current algorithms typically train a specific model for each time series. Maintaining such many models is impractical in a large-scale system with tens of thousands of curves. The performance of using merely one unified model to detect anomalies remains unknown. Secondly, most TSAD models are trained on the historical part of a time series and are tested on its future segment. In distributed systems, however, there are frequent system deployments and upgrades, with new, previously unseen time series emerging daily. The performance of testing newly incoming unseen time series on current TSAD algorithms remains unknown. Lastly, the assumptions of the evaluation metrics in existing benchmarks are far from practical demands. To solve the above-mentioned problems, we propose an industrial-grade benchmark TimeSeriesBench. We assess the performance of existing algorithms across more than 168 evaluation settings and provide comprehensive analysis for the future design of anomaly detection algorithms. An industrial dataset is also released along with TimeSeriesBench.

TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models

TL;DR

An industrial-grade benchmark TimeSeriesBench is proposed and the performance of existing algorithms across more than 168 evaluation settings are assessed and comprehensive analysis for the future design of anomaly detection algorithms is provided.

Abstract

Time series anomaly detection (TSAD) has gained significant attention due to its real-world applications to improve the stability of modern software systems. However, there is no effective way to verify whether they can meet the requirements for real-world deployment. Firstly, current algorithms typically train a specific model for each time series. Maintaining such many models is impractical in a large-scale system with tens of thousands of curves. The performance of using merely one unified model to detect anomalies remains unknown. Secondly, most TSAD models are trained on the historical part of a time series and are tested on its future segment. In distributed systems, however, there are frequent system deployments and upgrades, with new, previously unseen time series emerging daily. The performance of testing newly incoming unseen time series on current TSAD algorithms remains unknown. Lastly, the assumptions of the evaluation metrics in existing benchmarks are far from practical demands. To solve the above-mentioned problems, we propose an industrial-grade benchmark TimeSeriesBench. We assess the performance of existing algorithms across more than 168 evaluation settings and provide comprehensive analysis for the future design of anomaly detection algorithms. An industrial dataset is also released along with TimeSeriesBench.
Paper Structure (23 sections, 14 figures, 3 tables)

This paper contains 23 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: The industrial-grade concerns and TimeSeriesBench's solutions on benchmarking univariate time series anomaly detection algorithms.
  • Figure 2: Illustrations of anomaly types. Anomalous segments are highlighted in pink.
  • Figure 3: Illustrations of three learning schemas. $s_n$ indicates the n-th time series in the dataset.
  • Figure 4: Illustrations of evaluation criteria based on point-adjustment (PA). Point-wise PA gives an inflated score when some anomaly segments persist for a long duration. Event-wise PA treats each anomaly segment as an event, completely disregarding the length of the anomaly segment. Reduced-length PA considers the trade-offs between the two methods, holding greater practical significance in real-world applications.
  • Figure 5: Illustrations of k-delay adjustment. This strategy can be combined with point-adjustment as a complementary evaluation paradigm.
  • ...and 9 more figures