ModelRadar: Aspect-based Forecast Evaluation
Vitor Cerqueira, Luis Roque, Carlos Soares
TL;DR
Forecast evaluation often relies on a single aggregate metric, which hides how models perform under varying data and problem conditions. ModelRadar provides an aspect-based evaluation framework that analyzes performance across data characteristics (stationarity, seasonality, anomalies, sampling frequency) and problem characteristics (forecast horizon, difficulty), using three aggregation modes: overall performance, expected shortfall $ES_{\alpha}$ with $\alpha=10\%$, and win/loss analysis. Applying this framework to 24 forecasting methods (classical, ML-regression, and NHITS) on monthly and quarterly data reveals that NHITS is strong overall but its advantage varies by horizon and anomaly presence; classical methods like ETS/Theta are notably robust to anomalies, and horizon-specific dynamics favor classical methods for short horizons. The study offers practical guidance for model selection, emphasizes the value of multi-dimensional evaluation, and provides an open-source Python package to reproduce and extend the analysis.
Abstract
Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. While convenient, averaging performance over all samples dilutes relevant information about model behavior under varying conditions. This limitation is especially problematic for time series forecasting, where multiple layers of averaging--across time steps, horizons, and multiple time series in a dataset--can mask relevant performance variations. We address this limitation by proposing ModelRadar, a framework for evaluating univariate time series forecasting models across multiple aspects, such as stationarity, presence of anomalies, or forecasting horizons. We demonstrate the advantages of this framework by comparing 24 forecasting methods, including classical approaches and different machine learning algorithms. NHITS, a state-of-the-art neural network architecture, performs best overall but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, we found that NHITS (and also other neural networks) only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that classical approaches such as ETS or Theta are notably more robust in the presence of anomalies. These and other findings highlight the importance of aspect-based model evaluation for both practitioners and researchers. ModelRadar is available as a Python package.
