DAM: Towards A Foundation Model for Time Series Forecasting

Luke Darlow; Qiwen Deng; Ahmed Hassan; Martin Asenov; Rajkarn Singh; Artjom Joosen; Adam Barker; Amos Storkey

DAM: Towards A Foundation Model for Time Series Forecasting

Luke Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker, Amos Storkey

TL;DR

The paper tackles universal forecasting for irregularly sampled time series by introducing the DAM, a transformer-based model that ingests randomly sampled histories and outputs a continuous time function $f(t)$ via a basis decomposition. It employs a history sampling regime with a long-tail distribution to access distant past without committing to fixed horizons, and represents forecasts as a weighted sum of basis functions with coefficients produced by the model: $f(t,\boldsymbol{\theta},\boldsymbol{\nu})$. DAM demonstrates strong cross-dataset generalization, achieving state-of-the-art or near SoTA performance on long- and very-long-term forecasting, and shows robust zero-shot transfer to held-out datasets, as well as effectiveness for imputation. The approach combines interpretability through basis-function decomposition and attention, and offers flexible inference-cost trade-offs, positioning DAM as a foundation model for universal time-series forecasting across diverse domains and resolutions.

Abstract

It is challenging to scale time series forecasting models such that they forecast accurately for multiple distinct domains and datasets, all with potentially different underlying collection procedures (e.g., sample resolution), patterns (e.g., periodicity), and prediction requirements (e.g., reconstruction vs. forecasting). We call this general task universal forecasting. Existing methods usually assume that input data is regularly sampled, and they forecast to pre-determined horizons, resulting in failure to generalise outside of the scope of their training. We propose the DAM - a neural model that takes randomly sampled histories and outputs an adjustable basis composition as a continuous function of time for forecasting to non-fixed horizons. It involves three key components: (1) a flexible approach for using randomly sampled histories from a long-tail distribution, that enables an efficient global perspective of the underlying temporal dynamics while retaining focus on the recent history; (2) a transformer backbone that is trained on these actively sampled histories to produce, as representational output, (3) the basis coefficients of a continuous function of time. We show that a single univariate DAM, trained on 25 time series datasets, either outperformed or closely matched existing SoTA models at multivariate long-term forecasting across 18 datasets, including 8 held-out for zero-shot transfer, even though these models were trained to specialise for each dataset-horizon combination. This single DAM excels at zero-shot transfer and very-long-term forecasting, performs well at imputation, is interpretable via basis function composition and attention, can be tuned for different inference-cost requirements, is robust to missing and irregularly sampled data {by design}.

DAM: Towards A Foundation Model for Time Series Forecasting

TL;DR

via a basis decomposition. It employs a history sampling regime with a long-tail distribution to access distant past without committing to fixed horizons, and represents forecasts as a weighted sum of basis functions with coefficients produced by the model:

. DAM demonstrates strong cross-dataset generalization, achieving state-of-the-art or near SoTA performance on long- and very-long-term forecasting, and shows robust zero-shot transfer to held-out datasets, as well as effectiveness for imputation. The approach combines interpretability through basis-function decomposition and attention, and offers flexible inference-cost trade-offs, positioning DAM as a foundation model for universal time-series forecasting across diverse domains and resolutions.

Abstract

Paper Structure (55 sections, 2 equations, 22 figures, 9 tables)

This paper contains 55 sections, 2 equations, 22 figures, 9 tables.

Introduction
Related Work
Multi-scale modelling.
Frequency-domain modelling.
The DAM, explained
Backbone
Model structure.
History sampling regime: a new treatment of time
Forecasting mechanism: basis function composition
Basis function initialisation.
Training
Inference process
HSR tuning.
Experiments
Long-term time series forecasting
...and 40 more sections

Figures (22)

Figure 1: 1 Context time-value samples from the HSR (Section \ref{['sec:DAM-time']}) are sent to a 2 linear solver to initialise the basis coefficients, $\mathbf{\theta_0}$. These are 3 embedded into 4 B-tokens. Context data is also 5 embedded into 6 TV-tokens and processed through 7 4 layers of MHSA, ToME, and feed-forward blocks, with layer-norm, and used as 8 keys and values for cross attention, where the queries are 9 the B-tokens. Both TV- and B-tokens are 10 passed to proceeding layers. The 11 B-tokens from the final layer are projected into 12 basis coefficients for 13 forecasting and backcasting.
Figure 2: The HSR employed by the DAM, with the distribution in Equation \ref{['eq:hsr']} shown in yellow. Regularly sampled context and targets of the same size as those from the HSR are shown to demonstrate how the HSR enables a more global perspective while retaining focus close to 'now' ($t=0$).
Figure 3: Basis function initialisation versus the DAM, showing past fit and future extrapolation.
Figure 4: Very long-term forecasting. The 'OT' variable of Weather on the left and MSE versus horizon in the centre. The DAM produces better performing forecasts that also contain interesting multi-scale patterns, compared to baselines. To produce these figures the DAM context was set to 512, matching PatchTST. The inset shows from -512 to 720 steps. MSE vs. very-long horizon on 9 datasets is given in the table. Horizons were set according to 3/4 the length of the validation set.
Figure 5: Attention analysis showing HSR samples, backcast, forecast, past and future data (ETTh1 test), and cumulative attentions per TV-token. The degree of attention paid is colour-coded and normalised for each attention head. Basis coefficients per period are also shown.
...and 17 more figures

DAM: Towards A Foundation Model for Time Series Forecasting

TL;DR

Abstract

DAM: Towards A Foundation Model for Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (22)