Long-term Forecasting with TiDE: Time-series Dense Encoder

Abhimanyu Das; Weihao Kong; Andrew Leach; Shaan Mathur; Rajat Sen; Rose Yu

Long-term Forecasting with TiDE: Time-series Dense Encoder

Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, Rose Yu

TL;DR

<3-5 sentence high-level summary> TiDE addresses long-term multivariate forecasting by introducing a simple, fast MLP-based encoder-decoder that effectively incorporates covariates. A theoretical result shows a linear analogue can achieve near-optimal error for linear dynamical systems, while empirical results demonstrate TiDE matches or exceeds state-of-the-art neural baselines with substantial speedups. The model’s temporal decoder and covariate highways are shown to contribute to strong performance, and ablations highlight efficiency and robustness advantages over Transformer-based approaches. Overall, TiDE challenges the necessity of self-attention for these tasks by delivering competitive accuracy with major computational benefits.

Abstract

Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, Time-series Dense Encoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.

Long-term Forecasting with TiDE: Time-series Dense Encoder

TL;DR

Abstract

Paper Structure (26 sections, 3 theorems, 20 equations, 6 figures, 8 tables)

This paper contains 26 sections, 3 theorems, 20 equations, 6 figures, 8 tables.

Introduction
Background and Related Work
Problem Setting
Notation
Multivariate Forecasting
Model
Encoding
Decoding
Experimental Results
Long-Term Time-Series Forecasting
Demand Forecasting
Training and Inference Efficiency
Ablation Study
Temporal Decoder.
Context Size.
...and 11 more sections

Key Result

Proposition 1

Choose any $\varepsilon > 0$. Let $S = \{ (X_i, Y_i) \}_{i=1}^N$ be a set of i.i.d. training samples from a distribution $\mathcal{D}$. Let $\hat{h} \vcentcolon= \operatorname*{argmin}_{h \in \hat{\mathcal{H}}} \ell_S(h)$ with a choice of $k = \Theta(\log(1/\varepsilon) )$. Let $h^* \vcentcolon= \op

Figures (6)

Figure 1: Overview of TiDE architecture. The dynamic covariates per time-point are mapped to a lower dimensional space using a feature projection step. Then the encoder combines the look-back along with the projected covariates with the static attributes to form an encoding. The decoder maps this encoding to a vector per time-step in the horizon. Then a temporal decoder combines this vector (per time-step) with the projected features of that time-step in the horizon to form the final predictions. We also add a global linear residual connection from the look-back to the horizon.
Figure 2: In (a) we show the inference time per batch on the electricity dataset. In (b) we show the corresponding training times for one epoch. In both the figures the y-axis is plotted in log-scale. Note that the PatchTST model ran out of GPU memory for look-back $L \geq 1440$.
Figure 3: We plot the actuals vs the predictions from TiDE with and without the temporal decoder after just one epoch of training on the modified electricity dataset. The red part of the horizontal line indicates an event of Type A occuring.
Figure 4: We plot the Test MSE on the traffic dataset as a function of different context sizes for three different horizon length tasks. Each plot is an average of 5 runs with the 2 standard error interval plotted.
Figure 5: We perform an ablation study by presenting results from our model without any residual connections, on the electricity benchmark. We average over 5 runs for all the numbers and present the corresponding standard errors.
...and 1 more figures

Theorems & Definitions (5)

Definition 1
Definition 2: LDS predictor
Proposition 1: Generalization bound of learning LDS with auto-regressive algorithm
Proposition 2: Approximating LDS with auto-regressive model
Lemma 1: Generalization via Rademacher complexity

Long-term Forecasting with TiDE: Time-series Dense Encoder

TL;DR

Abstract

Long-term Forecasting with TiDE: Time-series Dense Encoder

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (5)