Table of Contents
Fetching ...

Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models

Yunshi Wen, Wesley M. Gifford, Chandra Reddy, Lam M. Nguyen, Jayant Kalagnanam, Anak Agung Julius

TL;DR

The paper interrogates the current TSFM landscape by controlling for training data and protocols, showing that a standard Patch Transformer with CPM, mask-aware normalization, and a quantile head can achieve state-of-the-art zero-shot probabilistic forecasting on GIFT-Eval. Through extensive ablations, it demonstrates that pretraining data composition and training recipe are the primary drivers of performance, rather than architectural novelty alone. The authors release open-source checkpoints and pipelines to establish a transparent, reproducible baseline and argue for standardized pretraining corpora and benchmarking to fairly assess architectural contributions. The work emphasizes data diversity and scalable training as crucial factors for real-world TSFM success, while inviting the community to separate architectural progress from data-driven gains in future evaluations.

Abstract

The recent surge in Time Series Foundation Models has rapidly advanced the field, yet the heterogeneous training setups across studies make it difficult to attribute improvements to architectural innovations versus data engineering. In this work, we investigate the potential of a standard patch Transformer, demonstrating that this generic architecture achieves state-of-the-art zero-shot forecasting performance using a straightforward training protocol. We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance. Our findings identify the key drivers of performance, while confirming that the generic architecture itself demonstrates excellent scalability. By strictly controlling these variables, we provide comprehensive empirical results on model scaling across multiple dimensions. We release our open-source model and detailed findings to establish a transparent, reproducible baseline for future research.

Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models

TL;DR

The paper interrogates the current TSFM landscape by controlling for training data and protocols, showing that a standard Patch Transformer with CPM, mask-aware normalization, and a quantile head can achieve state-of-the-art zero-shot probabilistic forecasting on GIFT-Eval. Through extensive ablations, it demonstrates that pretraining data composition and training recipe are the primary drivers of performance, rather than architectural novelty alone. The authors release open-source checkpoints and pipelines to establish a transparent, reproducible baseline and argue for standardized pretraining corpora and benchmarking to fairly assess architectural contributions. The work emphasizes data diversity and scalable training as crucial factors for real-world TSFM success, while inviting the community to separate architectural progress from data-driven gains in future evaluations.

Abstract

The recent surge in Time Series Foundation Models has rapidly advanced the field, yet the heterogeneous training setups across studies make it difficult to attribute improvements to architectural innovations versus data engineering. In this work, we investigate the potential of a standard patch Transformer, demonstrating that this generic architecture achieves state-of-the-art zero-shot forecasting performance using a straightforward training protocol. We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance. Our findings identify the key drivers of performance, while confirming that the generic architecture itself demonstrates excellent scalability. By strictly controlling these variables, we provide comprehensive empirical results on model scaling across multiple dimensions. We release our open-source model and detailed findings to establish a transparent, reproducible baseline for future research.
Paper Structure (42 sections, 6 equations, 14 figures, 7 tables)

This paper contains 42 sections, 6 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: The Landscape of Time Series Foundation Models on Probabilistic Forecasting. We compare our generic Transformer with SOTA TSFMs on the GIFT-Eval benchmark. The results establish a strong baseline for neural scaling in TSFMs, and reveals that scaling in pretraining data is the primary driver of performance.
  • Figure 2: An overview of the generic Transformer architecture.
  • Figure 3: Forecasting performance on the GIFT-Eval benchmark. The three metrics (MASE, CRPS, Rank) are aggregated with geometric mean over the 97 test cases. Lower values indicate better performance. The methods are sorted based on Aggregated Rank.
  • Figure 4: Impact of model scaling on forecasting performance: We compare increasing model capacity via embedding dimension (width) versus layer count (depth).
  • Figure 5: Impact of training scaling on forecasting performance: We evaluate the intermediate model checkpoints during training. MASE and CRPS for both Pretrain and Zero-shot variants exhibit a consistent monotonic improvement.
  • ...and 9 more figures