Forecasting with Deep Learning: Beyond Average of Average of Average Performance
Vitor Cerqueira, Luis Roque, Carlos Soares
TL;DR
This work argues that single-number evaluation metrics can obscure important condition-dependent performance in time-series forecasting. It introduces an aspect-based framework that analyzes univariate forecasts across factors such as sampling frequency, forecasting horizon, and anomaly presence, enabling a more nuanced comparison between NHITS, a state-of-the-art deep learning model, and classical methods like ARIMA and Theta. Empirical results show NHITS generally achieves superior SMAPE and worst-case performance, but its advantages vary with horizon and anomaly conditions, with classical methods sometimes outperforming it in anomalies or short-horizon forecasts. The study underscores the value of multi-faceted evaluation for selecting forecasting methods and guiding future research in model robustness and adaptability, especially for long-horizon forecasting and anomaly handling. $SMAPE$ and related metrics are used to quantify performance across diverse datasets (M3, M4, Tourism) and conditions, illustrating the practical implications of aspect-based evaluation for real-world forecasting tasks.
Abstract
Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of aspect-based model evaluation.
