Table of Contents
Fetching ...

PATE: Proximity-Aware Time series anomaly Evaluation

Ramin Ghorbani, Marcel J. T. Reinders, David M. J. Tax

TL;DR

This paper addresses the challenge of evaluating time-series anomaly detectors with temporal dynamics by introducing Proximity-Aware Time series anomaly Evaluation (PATE). PATE uses buffer zones around anomalies and proximity-based weights to compute a weighted Precision-Recall measure, aggregating over multiple pre- and post-buffer settings to yield a threshold-free final score: $PATE = \frac{1}{|E|\times|D|}\sum_{e\in E}\sum_{d\in D} \text{AUC-PR}_{e,d}$. Through synthetic and real-world experiments, PATE provides more nuanced and reliable model comparisons than traditional metrics like PA-F1, AUC-ROC, or VUS-ROC, and reveals ranking shifts among SOTA detectors. The approach offers a robust, adaptable framework for fair evaluation across diverse datasets and enables extension to binary outputs via PATE-F1, with code publicly available for reproducibility.

Abstract

Evaluating anomaly detection algorithms in time series data is critical as inaccuracies can lead to flawed decision-making in various domains where real-time analytics and data-driven strategies are essential. Traditional performance metrics assume iid data and fail to capture the complex temporal dynamics and specific characteristics of time series anomalies, such as early and delayed detections. We introduce Proximity-Aware Time series anomaly Evaluation (PATE), a novel evaluation metric that incorporates the temporal relationship between prediction and anomaly intervals. PATE uses proximity-based weighting considering buffer zones around anomaly intervals, enabling a more detailed and informed assessment of a detection. Using these weights, PATE computes a weighted version of the area under the Precision and Recall curve. Our experiments with synthetic and real-world datasets show the superiority of PATE in providing more sensible and accurate evaluations than other evaluation metrics. We also tested several state-of-the-art anomaly detectors across various benchmark datasets using the PATE evaluation scheme. The results show that a common metric like Point-Adjusted F1 Score fails to characterize the detection performances well, and that PATE is able to provide a more fair model comparison. By introducing PATE, we redefine the understanding of model efficacy that steers future studies toward developing more effective and accurate detection models.

PATE: Proximity-Aware Time series anomaly Evaluation

TL;DR

This paper addresses the challenge of evaluating time-series anomaly detectors with temporal dynamics by introducing Proximity-Aware Time series anomaly Evaluation (PATE). PATE uses buffer zones around anomalies and proximity-based weights to compute a weighted Precision-Recall measure, aggregating over multiple pre- and post-buffer settings to yield a threshold-free final score: . Through synthetic and real-world experiments, PATE provides more nuanced and reliable model comparisons than traditional metrics like PA-F1, AUC-ROC, or VUS-ROC, and reveals ranking shifts among SOTA detectors. The approach offers a robust, adaptable framework for fair evaluation across diverse datasets and enables extension to binary outputs via PATE-F1, with code publicly available for reproducibility.

Abstract

Evaluating anomaly detection algorithms in time series data is critical as inaccuracies can lead to flawed decision-making in various domains where real-time analytics and data-driven strategies are essential. Traditional performance metrics assume iid data and fail to capture the complex temporal dynamics and specific characteristics of time series anomalies, such as early and delayed detections. We introduce Proximity-Aware Time series anomaly Evaluation (PATE), a novel evaluation metric that incorporates the temporal relationship between prediction and anomaly intervals. PATE uses proximity-based weighting considering buffer zones around anomaly intervals, enabling a more detailed and informed assessment of a detection. Using these weights, PATE computes a weighted version of the area under the Precision and Recall curve. Our experiments with synthetic and real-world datasets show the superiority of PATE in providing more sensible and accurate evaluations than other evaluation metrics. We also tested several state-of-the-art anomaly detectors across various benchmark datasets using the PATE evaluation scheme. The results show that a common metric like Point-Adjusted F1 Score fails to characterize the detection performances well, and that PATE is able to provide a more fair model comparison. By introducing PATE, we redefine the understanding of model efficacy that steers future studies toward developing more effective and accurate detection models.
Paper Structure (18 sections, 13 equations, 11 figures, 6 tables)

This paper contains 18 sections, 13 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustration of anomaly detection in time series data.$a_{1-3}$ represent the actual anomalies as ground truth. Predictions are denoted by $p$. The durations of both events are indicated by the length of the boxes. Overlapping areas between $p$ and $a$ demonstrate where the model has correctly identified anomalies.
  • Figure 2: Illustration of the Categorization and Weighting Mechanism in the PATE Method. Prediction events ($p_{1}-p_{7}$) are represented by orange boxes, while anomaly events ($a_{1}-a_{4}$) are depicted by blue boxes. TP weights are illustrated with a blue line , FP weights with a red line , and FN weights with a purple line . Note that the solid segments of the lines, in contrast to the dotted segments, indicate the activated weights for the example scenario depicted in the figure.
  • Figure 3: Illustration of examples with synthetic data. The figure shows the placement of different anomaly scores $S$ from a binary anomaly detector.
  • Figure 4: Real-World Datasets and Anomaly Scores of Different Models. The anomalous segment and its corresponding region (labeled by an expert), against which the models' predictions are compared, is highlighted in red
  • Figure 4: Comparison of SOTA anomaly detection model using different evaluation metrics across various benchmark datasets.
  • ...and 6 more figures