A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models
Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Boris Oreshkin, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Michael W. Mahoney, Mengfei Cao, Dmitry Efimov
TL;DR
Cross-Frequency Transfer Learning (CFTL) seeks to scale foundation forecasting models by integrating multi-frequency time series, but existing benchmarks misrepresent performance due to leakage and limited data. The authors introduce a unified CFTL framework with leakage-free pretraining on proprietary and synthetic data, and evaluate zero-shot transfer on 15 large public datasets using a standardized evaluation pipeline. Their findings show traditional statistical baselines such as ARIMA and the SiCoUM ensemble frequently outperform neural CFTL models in both probabilistic ($sCRPS$) and point ($MASE$) forecasts, while synthetic pretraining can yield modest improvements for FFMs (roughly $7\%$ in $sCRPS$ and $20\%$ in $MASE$). These results underscore the need for realistic benchmarking in CFTL and highlight synthetic data and leakage-aware protocols as promising directions for closing the gap between neural and statistical approaches in practical forecasting.
Abstract
Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.
