Table of Contents
Fetching ...

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu

Abstract

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Abstract

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trendseasonalityforecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context () but foundation models dominate at long context (); (ii) forecastability is the dominant difficulty driver, producing a MAE gap across regimes; (iii) deep learning models match or surpass foundation models at fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.

Paper Structure

This paper contains 42 sections, 8 equations, 14 figures, 32 tables.

Figures (14)

  • Figure 1: Benchmark contribution rate across domains. Time series has the lowest share (4.2%) vs. NLP (9.9%), speech (7.0%), and vision (6.8%). See \ref{['app:arxiv']}.
  • Figure 2: 8-grid TSF regime classification across three benchmarks. \ref{['fig:gifteval_grid']}. GIFT-Eval is highly imbalanced, with 50.7% of series in a single low-structure regime. \ref{['fig:timer_grid']}. Timer concentrates 65.8% in the high-seasonality, high-forecastability regime. \ref{['fig:quito_grid']}. Quito distributes series near-uniformly (${\sim}12\%$ per regime).
  • Figure 3: Overall pipeline of Quito and QuitoBench. It contains five key stages: (1) Raw collection, (2) sanitization and standardization, (3) leakage-free temporal splitting, (4) trend/seasonality/forecastability computation and regime labeling, and (5) balanced QuitoBench construction and evaluation.
  • Figure 4: Scaling behavior on Quito for CrossFormer (deep learning) and TimesFM-2.5 (foundation model). More data yields far larger gains than more parameters for both models.
  • Figure 5: Efficiency frontier: mean rank vs. model scale. Deep learning models (blue, 0.3--5 M) match or beat foundation models (red, 30--200 M) at $58{\times}$ fewer parameters.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition 1: TSF Regime