Table of Contents
Fetching ...

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, Chenghao Liu

TL;DR

TIME presents a next-generation TSF benchmark that shifts emphasis from dataset-centric to task-centric evaluation, addressing data freshness, integrity, task realism, and analysis depth. It combines a human-in-the-loop data pipeline with pattern-level analysis using interpretable time-series features, enabling cross-dataset diagnostics and actionable insights. The benchmark comprises 50 fresh datasets and 98 forecasting tasks evaluated with 12 TSFMs, using a rolling protocol and normalization against a Seasonal Naive baseline, organized on a multi-granular leaderboard. Empirical results show that recent TSFMs outperform baselines and that pattern-aware analyses reveal nuanced model strengths across trend, seasonality, stationarity, and complexity. The work advances practical benchmarking by coupling quantitative metrics with qualitative visualization to bridge benchmark performance and real-world forecasting utility.

Abstract

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

TL;DR

TIME presents a next-generation TSF benchmark that shifts emphasis from dataset-centric to task-centric evaluation, addressing data freshness, integrity, task realism, and analysis depth. It combines a human-in-the-loop data pipeline with pattern-level analysis using interpretable time-series features, enabling cross-dataset diagnostics and actionable insights. The benchmark comprises 50 fresh datasets and 98 forecasting tasks evaluated with 12 TSFMs, using a rolling protocol and normalization against a Seasonal Naive baseline, organized on a multi-granular leaderboard. Empirical results show that recent TSFMs outperform baselines and that pattern-aware analyses reveal nuanced model strengths across trend, seasonality, stationarity, and complexity. The work advances practical benchmarking by coupling quantitative metrics with qualitative visualization to bridge benchmark performance and real-world forecasting utility.

Abstract

Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a rigorous human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 representative TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.
Paper Structure (106 sections, 15 equations, 13 figures, 5 tables, 2 algorithms)

This paper contains 106 sections, 15 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: Common bottlenecks in prevalent TSF benchmarks.
  • Figure 2: Timeline of time series forecasting (TSF) benchmarks. Each block represents a TSF benchmark, where color indicates the data source and outline style represents the domain diversity. Superscripts mark the prior benchmarks whose datasets are reused in each work. Major milestones and transitions in the field are highlighted in red along the timeline. TST denotes Time Series Transformer.
  • Figure 3: The overall workflow of TIME. (1) Fresh datasets are curated from diverse sources, integrating automated quality screening with LLM-assisted human decision-making to formulate context-aware forecasting tasks. (2) The intrinsic pattern of each variate is characterized by structural time series features derived via STL decomposition and mapped to binary encodings. (3) Representative TSFMs undergo rolling-window evaluations for probabilistic forecasting, where quantile-based metrics are computed and recorded for each window. (4) Performance analysis is conducted through pattern-targeted retrieval and multi-granular leaderboards.
  • Figure 4: Overall performance across all tasks. Task-level results are normalized by the Seasonal Naive baseline and aggregated using the geometric mean. Lower values (bottom-left corner) indicate better performance. Models annotated with $\dagger$ support multivariate modeling, whereas others operate under channel independence.
  • Figure 5: Comparison of MASE across different feature-specific variates. Each row represents a model's performance on variates with $F_k = 1$ () and $F_k = 0$ (). The distance between dots indicates the performance difference of the model to that specific feature.
  • ...and 8 more figures