Table of Contents
Fetching ...

RHiOTS: A Framework for Evaluating Hierarchical Time Series Forecasting Algorithms

Luis Roque, Carlos Soares, Luís Torgo

TL;DR

RHiOTS addresses robustness in hierarchical time series forecasting by generating semi-synthetic datasets through controlled leaf-series transformations and evaluating performance with $MASE$ while leveraging hierarchical coherence constraints. The framework emphasizes systematic perturbations (jittering, scaling, magnitude warping, time warping) and visual analytics to understand model behavior across hierarchy levels. Across Tourism, M5, and Police Houston, results show classical methods (e.g., ETS) are generally more robust than deep learning models, with deep models only gaining under highly disruptive transformations; MinT reconciliation provides no consistent robustness advantage. This work offers a practical, reproducible tool for selecting HTS methods under data shifts and highlights the value of robustness-focused evaluation beyond standard accuracy benchmarks.

Abstract

We introduce the Robustness of Hierarchically Organized Time Series (RHiOTS) framework, designed to assess the robustness of hierarchical time series forecasting models and algorithms on real-world datasets. Hierarchical time series, where lower-level forecasts must sum to upper-level ones, are prevalent in various contexts, such as retail sales across countries. Current empirical evaluations of forecasting methods are often limited to a small set of benchmark datasets, offering a narrow view of algorithm behavior. RHiOTS addresses this gap by systematically altering existing datasets and modifying the characteristics of individual series and their interrelations. It uses a set of parameterizable transformations to simulate those changes in the data distribution. Additionally, RHiOTS incorporates an innovative visualization component, turning complex, multidimensional robustness evaluation results into intuitive, easily interpretable visuals. This approach allows an in-depth analysis of algorithm and model behavior under diverse conditions. We illustrate the use of RHiOTS by analyzing the predictive performance of several algorithms. Our findings show that traditional statistical methods are more robust than state-of-the-art deep learning algorithms, except when the transformation effect is highly disruptive. Furthermore, we found no significant differences in the robustness of the algorithms when applying specific reconciliation methods, such as MinT. RHiOTS provides researchers with a comprehensive tool for understanding the nuanced behavior of forecasting algorithms, offering a more reliable basis for selecting the most appropriate method for a given problem.

RHiOTS: A Framework for Evaluating Hierarchical Time Series Forecasting Algorithms

TL;DR

RHiOTS addresses robustness in hierarchical time series forecasting by generating semi-synthetic datasets through controlled leaf-series transformations and evaluating performance with while leveraging hierarchical coherence constraints. The framework emphasizes systematic perturbations (jittering, scaling, magnitude warping, time warping) and visual analytics to understand model behavior across hierarchy levels. Across Tourism, M5, and Police Houston, results show classical methods (e.g., ETS) are generally more robust than deep learning models, with deep models only gaining under highly disruptive transformations; MinT reconciliation provides no consistent robustness advantage. This work offers a practical, reproducible tool for selecting HTS methods under data shifts and highlights the value of robustness-focused evaluation beyond standard accuracy benchmarks.

Abstract

We introduce the Robustness of Hierarchically Organized Time Series (RHiOTS) framework, designed to assess the robustness of hierarchical time series forecasting models and algorithms on real-world datasets. Hierarchical time series, where lower-level forecasts must sum to upper-level ones, are prevalent in various contexts, such as retail sales across countries. Current empirical evaluations of forecasting methods are often limited to a small set of benchmark datasets, offering a narrow view of algorithm behavior. RHiOTS addresses this gap by systematically altering existing datasets and modifying the characteristics of individual series and their interrelations. It uses a set of parameterizable transformations to simulate those changes in the data distribution. Additionally, RHiOTS incorporates an innovative visualization component, turning complex, multidimensional robustness evaluation results into intuitive, easily interpretable visuals. This approach allows an in-depth analysis of algorithm and model behavior under diverse conditions. We illustrate the use of RHiOTS by analyzing the predictive performance of several algorithms. Our findings show that traditional statistical methods are more robust than state-of-the-art deep learning algorithms, except when the transformation effect is highly disruptive. Furthermore, we found no significant differences in the robustness of the algorithms when applying specific reconciliation methods, such as MinT. RHiOTS provides researchers with a comprehensive tool for understanding the nuanced behavior of forecasting algorithms, offering a more reliable basis for selecting the most appropriate method for a given problem.
Paper Structure (11 sections, 1 equation, 5 figures, 1 table)

This paper contains 11 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Simple example of a hierarchically organized time series dataset that comprises sales data from a retailer in the US.
  • Figure 2: Ridge plot that shows the DTW distribution between time series in the dataset for each dataset (columns), transformation (rows), and parameter set. For the original DTW distance we use orange and for the transformed ones we use shades of blue: as we increase the magnitude of the transformation the color gets lighter.
  • Figure 3: Model performance across various data transformations in hierarchical time series forecasting, assessed using MASE. Each panel represents a different forecasting model subjected to transformations such as jitter, magnitude warping, scaling, and time warping, with the transformation intensity increasing from 'orig' to 'v5'. The lines within each panel correspond to different hierarchical levels of the data, providing insight into the robustness of each model at various granularities.
  • Figure 4: Ranking of the performance of forecasting methods under magnitude warping transformation for the Tourism dataset, from original data ('orig') to the most intense transformation ('v5'). Performance rank is indicated by proximity to the center, with 0 being the best. It shows that the performance of all algorithms deteriorates with increased transformation intensity, highlighted by the significant reordering of ranks and crossing of lines.
  • Figure 5: The chart on the left shows the performance of forecasting algorithms against multiple transformations for the Tourism dataset. The chart on the right averages the performance ranks of forecasting algorithms for all datasets.