Table of Contents
Fetching ...

fev-bench: A Realistic Benchmark for Time Series Forecasting

Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang

TL;DR

fev-bench addresses core gaps in time series forecasting benchmarks by delivering covariate-aware, multivariate evaluation across 100 tasks with principled, bootstrap-based aggregation. The accompanying fev library enables reproducible, lightweight benchmarking that integrates with popular forecasting stacks via adapters and YAML-defined tasks. Central contributions include a robust task design (covering horizons, covariates, and domains), two complementary aggregation statistics (marginal and pairwise) with confidence intervals, and infrastructure that supports community-driven benchmark evolution. Empirical results show strong performance of recent pretrained models, with covariates and multivariate capabilities highlighted as key areas for future improvement and research impact in real-world forecasting problems.

Abstract

Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly given the recent rise of pretrained models. Existing benchmarks often have narrow domain coverage or overlook important real-world settings, such as tasks with covariates. Additionally, their aggregation procedures often lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks also fail to provide infrastructure for consistent evaluation or are too rigid to integrate into existing pipelines. To address these gaps, we propose fev-bench, a benchmark comprising 100 forecasting tasks across seven domains, including 46 tasks with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for benchmarking forecasting models that emphasizes reproducibility and seamless integration with existing workflows. Usingfev, fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance along two complementary dimensions: win rates and skill scores. We report results on fev-bench for various pretrained, statistical and baseline models, and identify promising directions for future research.

fev-bench: A Realistic Benchmark for Time Series Forecasting

TL;DR

fev-bench addresses core gaps in time series forecasting benchmarks by delivering covariate-aware, multivariate evaluation across 100 tasks with principled, bootstrap-based aggregation. The accompanying fev library enables reproducible, lightweight benchmarking that integrates with popular forecasting stacks via adapters and YAML-defined tasks. Central contributions include a robust task design (covering horizons, covariates, and domains), two complementary aggregation statistics (marginal and pairwise) with confidence intervals, and infrastructure that supports community-driven benchmark evolution. Empirical results show strong performance of recent pretrained models, with covariates and multivariate capabilities highlighted as key areas for future improvement and research impact in real-world forecasting problems.

Abstract

Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly given the recent rise of pretrained models. Existing benchmarks often have narrow domain coverage or overlook important real-world settings, such as tasks with covariates. Additionally, their aggregation procedures often lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks also fail to provide infrastructure for consistent evaluation or are too rigid to integrate into existing pipelines. To address these gaps, we propose fev-bench, a benchmark comprising 100 forecasting tasks across seven domains, including 46 tasks with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for benchmarking forecasting models that emphasizes reproducibility and seamless integration with existing workflows. Usingfev, fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance along two complementary dimensions: win rates and skill scores. We report results on fev-bench for various pretrained, statistical and baseline models, and identify promising directions for future research.

Paper Structure

This paper contains 38 sections, 1 theorem, 19 equations, 7 figures, 17 tables.

Key Result

Proposition 3.1

Suppose all $M$ models are compared on the same $R$ tasks, with pairwise win rates $W_{jk}$ (eq:pairwise-winrate) and average win rates $W_j=\tfrac{1}{M-1}\sum_{k\neq j} W_{jk}$ (eq:avg-winrate). At the BT MLE with scale $\lambda>0$,

Figures (7)

  • Figure 1: Pairwise win rates (a) and skill scores (b) of the top-3 models against other models under the SQL metric on fev-bench, with 95% confidence intervals obtained via bootstrapping. Higher values are better. The confidence intervals show heavy overlap between TiRex and TimesFM-2.5, suggesting no clear winner between the two, while both outperform the remaining models. Full pairwise results are available in \ref{['app:extra-results']}. Best viewed on screen.
  • Figure 2: Average skill scores on the 42 fev-bench tasks with dynamic covariates (based on SQL). TabPFN-TS, the only model that uses known covariates, outperforms all others, indicating that pretrained models miss valuable predictive signal from covariates.
  • Figure 3: Average skill scores relative to the baseline on 35 multivariate tasks of fev-bench (based on SQL). Toto-1.0, the only multivariate model, outperforms others despite ranking third overall in \ref{['tab:lb-sql-subset']}, indicating room for improvement on multivariate forecasting.
  • Figure 4: Pairwise win rates $W_{jk}$ (\ref{['eq:pairwise-winrate']}) of all models against each other under the scaled quantile loss (SQL) metric on fev-bench, with 95% confidence intervals obtained via bootstrapping. Higher values are better.
  • Figure 5: Pairwise skill scores $S_{jk}$ (\ref{['eq:pairwise-skillscore']}) of all models against each other under the scaled quantile loss (SQL) metric on fev-bench, with 95% confidence intervals obtained via bootstrapping. Higher values are better. Note that pairwise skill score is not symmetric, $S_{jk} \ne S_{kj}$.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof