A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB)
Mohammad Karami, Mostafa Jalali, Fatemeh Ghassemi
TL;DR
This work tackles the lack of cross-domain evaluation in time series anomaly detection by proposing a unified forecasting-based framework that integrates classical methods (Holt-Winters, SARIMA) and deep-learning forecasters (LSTM, Informer) within a common residual-based detection interface. It conducts the first complete evaluation on the Numenta Anomaly Benchmark (NAB), covering 58 datasets across seven categories with 232 model training runs and 464 detection evaluations, achieving a 100% success rate. Key findings show that LSTM delivers the strongest overall performance on real-world data, Informer offers competitive accuracy with lower training time, and classical methods excel only on simple synthetic patterns; importantly, forecasting quality drives detection performance more than the choice of detection method. The results provide evidence-based guidance for practitioners (prefer LSTM for complex patterns, use Informer for efficiency, revert to classical methods for well-behaved seasonal data) and establish robust baselines to spur future forecasting-based anomaly detection research and reproducibility.
Abstract
Time series anomaly detection is critical for modern digital infrastructures, yet existing methods lack systematic cross-domain evaluation. We present a comprehensive forecasting-based framework unifying classical methods (Holt-Winters, SARIMA) with deep learning architectures (LSTM, Informer) under a common residual-based detection interface. Our modular pipeline integrates preprocessing (normalization, STL decomposition), four forecasting models, four detection methods, and dual evaluation through forecasting metrics (MAE, RMSE, PCC) and detection metrics (Precision, Recall, F1, AUC). We conduct the first complete evaluation on the Numenta Anomaly Benchmark (58 datasets, 7 categories) with 232 model training runs and 464 detection evaluations achieving 100\% success rate. LSTM achieves best performance (F1: 0.688, ranking first or second on 81\% of datasets) with exceptional correlation on complex patterns (PCC: 0.999). Informer provides competitive accuracy (F1: 0.683) with 30\% faster training. Classical methods achieve perfect predictions on simple synthetic data with 60 lower cost but show 2-3 worse F1-scores on real-world datasets. Forecasting quality dominates detection performance: differences between detection methods (F1: 0.621-0.688) are smaller than between forecasting models (F1: 0.344-0.688). Our findings provide evidence-based guidance: use LSTM for complex patterns, Informer for efficiency-critical deployments, and classical methods for simple periodic data with resource constraints. The complete implementation and results establish baselines for future forecasting-based anomaly detection research.
