Table of Contents
Fetching ...

There are no Champions in Supervised Long-Term Time Series Forecasting

Lorenzo Brigato, Rafael Morand, Knut Strømmen, Maria Panagiotou, Markus Schmidt, Stavroula Mougiakakou

TL;DR

This paper tackles the lack of reliable champions in supervised long-term time-series forecasting by showing that no model consistently dominates across diverse benchmarks under standardized evaluation. It conducts a broad, reproducible study of top architectures across 14 datasets and extensive HP searches, revealing that rankings are highly sensitive to experimental setup. The authors demonstrate that dataset choice, horizon selection, baseline inclusion, HP tuning, visualization practices, and statistical testing all shape reported progress, often more than architectural advances. They conclude with concrete recommendations to improve benchmarking practices and promote transparent, reproducible claims that better reflect real progress.

Abstract

Recent advances in long-term time series forecasting have introduced numerous complex supervised prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. In this study, we first perform a broad, thorough, and reproducible evaluation of the top-performing supervised models on the most popular benchmark and additional baselines representing the most active architecture families. This extensive evaluation assesses eight models on 14 datasets, encompassing $\sim$5,000 trained networks for the hyperparameter (HP) searches. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings emphasize the need to shift focus away from pursuing ever-more complex models, towards enhancing benchmarking practices through rigorous and standardized evaluations that enable more substantiated claims, including reproducible HP setups and statistical testing. We offer recommendations for future research.

There are no Champions in Supervised Long-Term Time Series Forecasting

TL;DR

This paper tackles the lack of reliable champions in supervised long-term time-series forecasting by showing that no model consistently dominates across diverse benchmarks under standardized evaluation. It conducts a broad, reproducible study of top architectures across 14 datasets and extensive HP searches, revealing that rankings are highly sensitive to experimental setup. The authors demonstrate that dataset choice, horizon selection, baseline inclusion, HP tuning, visualization practices, and statistical testing all shape reported progress, often more than architectural advances. They conclude with concrete recommendations to improve benchmarking practices and promote transparent, reproducible claims that better reflect real progress.

Abstract

Recent advances in long-term time series forecasting have introduced numerous complex supervised prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. In this study, we first perform a broad, thorough, and reproducible evaluation of the top-performing supervised models on the most popular benchmark and additional baselines representing the most active architecture families. This extensive evaluation assesses eight models on 14 datasets, encompassing 5,000 trained networks for the hyperparameter (HP) searches. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings emphasize the need to shift focus away from pursuing ever-more complex models, towards enhancing benchmarking practices through rigorous and standardized evaluations that enable more substantiated claims, including reproducible HP setups and statistical testing. We offer recommendations for future research.

Paper Structure

This paper contains 34 sections, 4 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: There is no champion. The relative MSE averaged over all forecast horizons reveals that no model dominates on all datasets.
  • Figure 2: Potential lack in dataset diversity The benchmarks do not span a wide range of frequencies and number of variates across domains.
  • Figure 3: Uni- vs. multivariate PatchTST and iTransformer perform comparably in terms of explained variance.
  • Figure 4: Model rankings are highly sensitive to dataset and horizon selection. We assess the robustness of rankings across 5,000 experimental configurations, each using a random subset of datasets and forecast horizons. Including MotorImagery, the only dataset with clear model gaps (\ref{['fig:all_ds_violin']}, \ref{['sec:full_results']}), favors S-Mamba, while excluding it yields close performance across models. This highlights the brittleness of current benchmarks, where small changes in datasets or forecast horizons can easily shift which model appears as a champion. Best and second-best are highlighted.
  • Figure 5: Bias in visualizations. The plots show the same results (MSE) represented at two scales. The relative scale makes performance differences between models appear more subtle.
  • ...and 3 more figures