PATE: Proximity-Aware Time series anomaly Evaluation
Ramin Ghorbani, Marcel J. T. Reinders, David M. J. Tax
TL;DR
This paper addresses the challenge of evaluating time-series anomaly detectors with temporal dynamics by introducing Proximity-Aware Time series anomaly Evaluation (PATE). PATE uses buffer zones around anomalies and proximity-based weights to compute a weighted Precision-Recall measure, aggregating over multiple pre- and post-buffer settings to yield a threshold-free final score: $PATE = \frac{1}{|E|\times|D|}\sum_{e\in E}\sum_{d\in D} \text{AUC-PR}_{e,d}$. Through synthetic and real-world experiments, PATE provides more nuanced and reliable model comparisons than traditional metrics like PA-F1, AUC-ROC, or VUS-ROC, and reveals ranking shifts among SOTA detectors. The approach offers a robust, adaptable framework for fair evaluation across diverse datasets and enables extension to binary outputs via PATE-F1, with code publicly available for reproducibility.
Abstract
Evaluating anomaly detection algorithms in time series data is critical as inaccuracies can lead to flawed decision-making in various domains where real-time analytics and data-driven strategies are essential. Traditional performance metrics assume iid data and fail to capture the complex temporal dynamics and specific characteristics of time series anomalies, such as early and delayed detections. We introduce Proximity-Aware Time series anomaly Evaluation (PATE), a novel evaluation metric that incorporates the temporal relationship between prediction and anomaly intervals. PATE uses proximity-based weighting considering buffer zones around anomaly intervals, enabling a more detailed and informed assessment of a detection. Using these weights, PATE computes a weighted version of the area under the Precision and Recall curve. Our experiments with synthetic and real-world datasets show the superiority of PATE in providing more sensible and accurate evaluations than other evaluation metrics. We also tested several state-of-the-art anomaly detectors across various benchmark datasets using the PATE evaluation scheme. The results show that a common metric like Point-Adjusted F1 Score fails to characterize the detection performances well, and that PATE is able to provide a more fair model comparison. By introducing PATE, we redefine the understanding of model efficacy that steers future studies toward developing more effective and accurate detection models.
