Multi-layer Stack Ensembles for Time Series Forecasting
Nathanael Bosch, Oleksandr Shchur, Nick Erickson, Michael Bohlke-Schneider, Caner Türkmen
TL;DR
The paper identifies forecast ensembling as a key driver of accuracy in time series forecasting and demonstrates that stacking, especially when deployed in a multi-layer framework, yields robust gains across diverse real-world datasets. It introduces a three-layer stacking architecture (L1 base forecasters, L2 stackers, L3 aggregator) and provides a rigorous training protocol via time-series cross-validation to prevent leakage. Through a large-scale benchmark of 33 ensemble methods on 50 datasets, the study shows that multi-layer stacking consistently outperforms single-layer approaches and simple averages, while remaining robust to changes in base-model selection. The results have practical implications for AutoML systems in forecasting, suggesting that adaptive, diverse ensembling strategies can significantly improve predictive accuracy across point and probabilistic tasks, albeit at higher computation costs. The work also offers detailed ablations and guidance on data usage, model choices, and retraining strategies to maximize performance while acknowledging resource trade-offs.
Abstract
Ensembling is a powerful technique for improving the accuracy of machine learning models, with methods like stacking achieving strong results in tabular tasks. In time series forecasting, however, ensemble methods remain underutilized, with simple linear combinations still considered state-of-the-art. In this paper, we systematically explore ensembling strategies for time series forecasting. We evaluate 33 ensemble models -- both existing and novel -- across 50 real-world datasets. Our results show that stacking consistently improves accuracy, though no single stacker performs best across all tasks. To address this, we propose a multi-layer stacking framework for time series forecasting, an approach that combines the strengths of different stacker models. We demonstrate that this method consistently provides superior accuracy across diverse forecasting scenarios. Our findings highlight the potential of stacking-based methods to improve AutoML systems for time series forecasting.
