Table of Contents
Fetching ...

Transformers and Their Roles as Time Series Foundation Models

Dennis Wu, Yihan He, Yuan Cao, Jianqing Fan, Han Liu

TL;DR

First, it is demonstrated that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent, and MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates, is analyzed.

Abstract

We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.

Transformers and Their Roles as Time Series Foundation Models

TL;DR

First, it is demonstrated that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent, and MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates, is analyzed.

Abstract

We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.

Paper Structure

This paper contains 54 sections, 24 theorems, 129 equations, 2 figures, 3 tables.

Key Result

Lemma 3.2

Given a sequence of token $\boldsymbol{H}$ in the form of Equation eqn:input-data, there exists a one-layer, $q_{\max}$ head attention layer, such that for any $q \leq q_{\max}$, the columns of $\text{Attn}_{\bm{\theta}}^{\dagger}( \boldsymbol{H} )$ has the following form:

Figures (2)

  • Figure 1: Top: Model performance on data with different number of covariates. For both MOIRAI and MOIRAI-relu, we observe their performance behave like least squares. As in our construction, the longer the lookback size is, the more examples available for transformers to fit an $\mathtt{AR}$ model. Note that our test data has variance $\sigma^2 = 1$, thus the MSE for both models are expected to converge to $1$ as the lookback size increases. Bottom: Generalization to unseen values of $d, q$. From left to right, we have MOIRAI's generalization performance (pretrained on $d\in\{4,5\}, q\in\{4,5\}$) on high dimensional data ($d=10$), low dimensional data ($d=2$) and high lag step + low dimensional data ($d=3,q=7$). Note that high and low is compared with pretraining data. We observe that even when MOIRAI did not learn from any time series with $d=10$, it is still able to generalize well and shows even better sample complexity than least squares regression. Finally, even when both $q,d$ are unseen, it does not impact MOIRAI's ability to make accuracy predictions.
  • Figure 2: We observe that when least squares regression fails to obtain the optimal error rate for prediction, transformers are capable of having their MSE converge towards $1$ as the lookback size increases. This indicates that these models are capable of fitting a more complex model other than linear regression on a given time series.

Theorems & Definitions (42)

  • Definition 2.1: Attention layer
  • Definition 2.2: Any-variate Attention.
  • Remark 2.3
  • Definition 2.4: MLP Layer
  • Definition 2.5: MOIRAI Transformer
  • Remark 2.6
  • Remark 3.1
  • Lemma 3.2
  • Proposition 3.4: Uni-variate Autoregressive Regression via Transformers
  • Lemma 3.5
  • ...and 32 more