A decoder-only foundation model for time-series forecasting

Abhimanyu Das; Weihao Kong; Rajat Sen; Yichen Zhou

A decoder-only foundation model for time-series forecasting

Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou

TL;DR

This work presents TimesFM, a decoder-only, patch-based transformer designed as a time-series forecasting foundation model trained from scratch on a large, diverse mix of real and synthetic data. It demonstrates strong zero-shot forecasting across unseen datasets and granularities, achieving near state-of-the-art accuracy without dataset-specific fine-tuning. Key innovations include patch-based input/output handling, longer output patches for efficient horizon forecasting, and masking strategies to cover varying context lengths. The model’s large-scale pretraining, empirical validations, and planned open-release position it as a practical, general-purpose forecaster with broad real-world impact, while acknowledging ethical considerations and avenues for future enhancement (probabilistic forecasts, covariates, and finetuning).

Abstract

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.

A decoder-only foundation model for time-series forecasting

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 9 figures, 6 tables)

This paper contains 23 sections, 6 equations, 9 figures, 6 tables.

Introduction
Related Work
Problem Definition
Model Architecture
Pretraining Details
Empirical Results
Zero-shot Evaluation
Ablation
Conclusion
Impact Statement
Appendix
Limitations and Future Work
Metrics
Finetuning study on ETT
Pretraining PatchTST
...and 8 more sections

Figures (9)

Figure 1: We provide an illustration of the TimesFM model architecture during training, where we show a input time-series of a specific length that can be broken down into input patches. Each patch along is processed into a vector by a residual block (as defined in the model definition) to the model dimension of the transformer layers. The vector is then added to positional encodings and fed into $n_l$ stacked transformer layers. SA refers to self-attention (note that we use multi-head causal attention) and FFN is the fully connected layer in the transformer. The output tokens are then mapped through a residual block to an output of size output_ patch_ len, which is the forecast for the time window following the last input patch seen by the model so far.
Figure 2: We report average performance in three groups of datasets. In all figures, the lower the metric the better and the error bars represent one standard error. Note that among the baselines only TimesFM and llmtime are zero-shot. In (a) we report results on the Monash datasets. Since the datasets have different scales, we take the Geometric Mean (GM) of the MAE's scaled by the MAE of a naive baseline. We can see that TimesFM is the top model. In (b), we report the similarly scaled MAE on the Darts benchmarks. TimesFM is within significance of the best performing methods which are ARIMA and llmtime in this case. Note that these datasets have one time-series each and therefore statistical methods are competitive with deep learning ones. Finally, in (c) we report the average MAE for 96 and 192 horizon prediction tasks on 4 ETT datasets i.e 8 tasks in total. TimesFM and PatchTST are the best performing models
Figure 3: Ablation studies with respect to various design choices.
Figure 4: We report average performance in three groups of datasets. In all figures, the lower the metric the better and the error bars represent one standard error. Note that among the baselines only TimesFM and llmtime are zero-shot. In (a) we report results on the Monash datasets. Since the datasets have different scales, we take the Arithmetric Mean (AM) the MAE's scaled by the MAE of a naive baseline. We can see that TimesFM is within significance of the top model N-BEATS. In (b), we report the similarly scaled MAE on the Darts benchmarks. TimesFM is within significance of the top of method which is ARIMA in this case. Note that these datasets have one time-series each and therefore statistical methods are competitive with deep learning ones. Finally, in (c) we report the average MAE for 96 and 192 horizon prediction tasks on 4 ETT datasets i.e 8 tasks in total. TimesFM and PatchTST are the best performing models in this case.
Figure 5: Forecasts visualized on synthetic curves. The bottom row plots zoom in on the prediction horizon for the sake of clarity.
...and 4 more figures

A decoder-only foundation model for time-series forecasting

TL;DR

Abstract

A decoder-only foundation model for time-series forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)