Table of Contents
Fetching ...

A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Boris Oreshkin, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Michael W. Mahoney, Mengfei Cao, Dmitry Efimov

TL;DR

Cross-Frequency Transfer Learning (CFTL) seeks to scale foundation forecasting models by integrating multi-frequency time series, but existing benchmarks misrepresent performance due to leakage and limited data. The authors introduce a unified CFTL framework with leakage-free pretraining on proprietary and synthetic data, and evaluate zero-shot transfer on 15 large public datasets using a standardized evaluation pipeline. Their findings show traditional statistical baselines such as ARIMA and the SiCoUM ensemble frequently outperform neural CFTL models in both probabilistic ($sCRPS$) and point ($MASE$) forecasts, while synthetic pretraining can yield modest improvements for FFMs (roughly $7\%$ in $sCRPS$ and $20\%$ in $MASE$). These results underscore the need for realistic benchmarking in CFTL and highlight synthetic data and leakage-aware protocols as promising directions for closing the gap between neural and statistical approaches in practical forecasting.

Abstract

Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

TL;DR

Cross-Frequency Transfer Learning (CFTL) seeks to scale foundation forecasting models by integrating multi-frequency time series, but existing benchmarks misrepresent performance due to leakage and limited data. The authors introduce a unified CFTL framework with leakage-free pretraining on proprietary and synthetic data, and evaluate zero-shot transfer on 15 large public datasets using a standardized evaluation pipeline. Their findings show traditional statistical baselines such as ARIMA and the SiCoUM ensemble frequently outperform neural CFTL models in both probabilistic () and point () forecasts, while synthetic pretraining can yield modest improvements for FFMs (roughly in and in ). These results underscore the need for realistic benchmarking in CFTL and highlight synthetic data and leakage-aware protocols as promising directions for closing the gap between neural and statistical approaches in practical forecasting.

Abstract

Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

Paper Structure

This paper contains 26 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Naively padding and combining series of different frequencies to train global models leads to two challenges: (a) the unbalanced observations of series of different frequencies, saturate learning signals and induce inverse frequency aliasing effects; and (b) heterogeneous time series scales, that bias gradient optimization. These unresolved challenges still prevent FFMs to replace statistical models and neural forecasting models specialized on each frequency.
  • Figure 2: Three-layer fully connected network predictive function. Classic forecasting applications optimize distinct model parameters for source $D^{(S)}$ and target $D^{(T)}$ datasets, a) and b) columns. Parameter-based transfer-learning leverages source dataset knowledge by using a pre-trained model's parameters $\boldsymbol{\theta}^{(S)}_{l}$, to initialize another model's parameters $\boldsymbol{\theta}^{(T)}_{l}$ that can specialize on a target dataset.
  • Figure 3: For our CFTL task, we use two datasets: (a) a set of real-world datasets composed of large-scale online retail demand; and (b) a set of synthetic dataset composed of Gaussian processes, Fourier harmonic signals, and polynomial trends.
  • Figure 4: Pre-training datasets ablation, with and without the use of synthetic data. Shown are metrics with (red) and without (green) synthetic data for pre-training, for the NBEATS model.