Unified Training of Universal Time Series Forecasting Transformers

Gerald Woo; Chenghao Liu; Akshat Kumar; Caiming Xiong; Silvio Savarese; Doyen Sahoo

Unified Training of Universal Time Series Forecasting Transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, Doyen Sahoo

TL;DR

The paper tackles the challenge of universal time series forecasting by introducing Moirai, a masked encoder Transformer that handles cross-frequency data, arbitrary variates, and flexible probabilistic outputs. It relies on LOTSA, a large-scale open time series archive, to pre-train a single model capable of zero-shot forecasting across diverse datasets. Empirical results demonstrate competitive or superior zero-shot performance in both in-distribution and out-of-distribution settings, including probabilistic and long-horizon forecasts, with extensive ablations confirming the value of multi patch sizes, any-variate attention, and a mixture distribution head. The work highlights the potential of unified training for LTMs and outlines practical considerations and future directions for scaling and multi-modality.

Abstract

Deep learning for time series forecasting has traditionally operated within a one-model-per-dataset framework, limiting its potential to leverage the game-changing impact of large pre-trained models. The concept of universal forecasting, emerging from pre-training on a vast collection of time series datasets, envisions a single Large Time Series Model capable of addressing diverse downstream forecasting tasks. However, constructing such a model poses unique challenges specific to time series data: i) cross-frequency learning, ii) accommodating an arbitrary number of variates for multivariate time series, and iii) addressing the varying distributional properties inherent in large-scale data. To address these challenges, we present novel enhancements to the conventional time series Transformer architecture, resulting in our proposed Masked Encoder-based Universal Time Series Forecasting Transformer (Moirai). Trained on our newly introduced Large-scale Open Time Series Archive (LOTSA) featuring over 27B observations across nine domains, Moirai achieves competitive or superior performance as a zero-shot forecaster when compared to full-shot models. Code, data, and model weights can be found at https://github.com/SalesforceAIResearch/uni2ts.

Unified Training of Universal Time Series Forecasting Transformers

TL;DR

Abstract

Paper Structure (61 sections, 10 equations, 9 figures, 23 tables)

This paper contains 61 sections, 10 equations, 9 figures, 23 tables.

Introduction
Related Work
Pre-training for Zero-shot Forecasting
Pre-training for Zero-shot Forecasting
Pre-train + Fine-tune for Time Series Forecasting
Method
Problem Formulation
Architecture
Multi Patch Size Projection Layers
Any-variate Attention
Mixture Distribution
Unified Training
LOTSA Data
Pre-training
Data Distribution
...and 46 more sections

Figures (9)

Figure 1: A universal forecaster is a large pre-trained model capable of handling any time series forecasting problem. It is trained on a large-scale time series dataset spanning multiple domains. Compared to the existing paradigm, universal forecasting faces the three key issues of $i)$ multiple frequencies, $ii)$ any-variate forecasting, and $iii)$ varying distributions.
Figure 2: Overall architecture of Moirai. Visualized is a 3-variate time series, where variates 0 and 1 are target variables (i.e. to be forecasted, and variate 2 is a dynamic covariate (values in forecast horizon known). Based on a patch size of 64, each variate is patchified into 3 tokens. The patch embeddings along with sequence and variate id are fed into the Transformer. The shaded patches represent the forecast horizon to be forecasted, whose corresponding output representations are mapped into the mixture distribution parameters.
Figure 3: Aggregate results of the Monash Time Series Forecasting Benchmark. The normalized MAE is reported, which normalizes the MAE of each dataset by the naive forecast's MAE, and aggregated by taking the geometric mean across datasets.
Figure 4: Visualization of probabilistic forecasts by two variants of MoiraiSmall on the Traffic Hourly dataset. Both models forecast peaks, however, the Student's t-distribution has a symmetric distribution, giving inappropriate prediction intervals for a peak, as highlighted in red.
Figure 5: Plot of performance (MAE) against context length (x-axis in log scale) with prediction length 96 and patch size 32 on the validation set of the ETTm1, Electricity, and Weather datasets.
...and 4 more figures

Unified Training of Universal Time Series Forecasting Transformers

TL;DR

Abstract

Unified Training of Universal Time Series Forecasting Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)