Table of Contents
Fetching ...

Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

Xinghong Fu, Yanhong Li, Georgios Papaioannou, Yoon Kim

TL;DR

The paper tackles the high computational cost of time series foundation models by proposing Reverso, a compact decoder-based TSFM that achieves strong zero-shot forecasting with as few as $0.2M$ to $2.6M$ parameters. It combines a hybrid sequence-mixing architecture (long convolutions and DeltaNet) with an attention-based decoder, a lightweight embedding, targeted data augmentation, Gaussian-process synthetic data, and FFT-based downsampling to expand effective context. Empirical results show Reverso attains competitive or superior performance on GiftEval and LTSF benchmarks relative to much larger baselines, thereby pushing the efficiency-performance Pareto frontier. The work provides a practical recipe for compact TSFMs, with ablations highlighting the value of the hybrid architecture and inference strategies, and discusses limitations and future work in multivariate forecasting and uncertainty quantification.

Abstract

Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.

Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

TL;DR

The paper tackles the high computational cost of time series foundation models by proposing Reverso, a compact decoder-based TSFM that achieves strong zero-shot forecasting with as few as to parameters. It combines a hybrid sequence-mixing architecture (long convolutions and DeltaNet) with an attention-based decoder, a lightweight embedding, targeted data augmentation, Gaussian-process synthetic data, and FFT-based downsampling to expand effective context. Empirical results show Reverso attains competitive or superior performance on GiftEval and LTSF benchmarks relative to much larger baselines, thereby pushing the efficiency-performance Pareto frontier. The work provides a practical recipe for compact TSFMs, with ablations highlighting the value of the hybrid architecture and inference strategies, and discusses limitations and future work in multivariate forecasting and uncertainty quantification.

Abstract

Learning time series foundation models has been shown to be a promising approach for zero-shot time series forecasting across diverse time series domains. Insofar as scaling has been a critical driver of performance of foundation models in other modalities such as language and vision, much recent work on time series foundation modeling has focused on scaling. This has resulted in time series foundation models with hundreds of millions of parameters that are, while performant, inefficient and expensive to use in practice. This paper describes a simple recipe for learning efficient foundation models for zero-shot time series forecasting that are orders of magnitude smaller. We show that large-scale transformers are not necessary: small hybrid models that interleave long convolution and linear RNN layers (in particular DeltaNet layers) can match the performance of larger transformer-based models while being more than a hundred times smaller. We also describe several data augmentation and inference strategies that further improve performance. This recipe results in Reverso, a family of efficient time series foundation models for zero-shot forecasting that significantly push the performance-efficiency Pareto frontier.
Paper Structure (36 sections, 13 equations, 7 figures, 12 tables, 5 algorithms)

This paper contains 36 sections, 13 equations, 7 figures, 12 tables, 5 algorithms.

Figures (7)

  • Figure 1: Zero-shot performance on the full Gift-Eval test set aksu2024giftevalbenchmarkgeneraltime. Reverso sets a new performance-efficiency Pareto frontier compared to existing time series foundation models.
  • Figure 2: Reverso architecture. An input sequence $t \in \mathbb{R}^{L}$ of length $L$ is first passed through a single projection layer to obtain embedding representations $x \in \mathbb{R}^{L \times d}$. Then, $n_{layers}$ of sequence-mixing and channel-mixing blocks operates on $x$, where we alternate between long convolutions and DeltaNet for sequence mixing across length $L$, and use MLP layers for channel mixing across dimension $d$. The final output head (based on an attention-based transformation) obtains the predictions $\hat{y} \in \mathbb{R}^p$.
  • Figure 3: Our data augmentation (left) and synthetic data generation (right) pipeline. For data augmention we apply a series of standard data augmentations: downsampling, amplitude modulation, vertical flip, horizontal flip, censor, mixup. For synthetic data generation, we generate data from Gaussian process with randomly selected kernels from a kernel bank, and combine this with spike/trapezoidal patterns as well as processes sampled with trend, seasonality and irregularity.
  • Figure 4: LTSF performance vs. Parameter Count. MAE is averaged over the horizons of $\{96, 192, 336, 720\}$ for the datasets ETTh1, ETTh2, ETTm1, ETTm2, Electricity and Weather. For models which are not evaluated on all the datasets (e.g. YingLong did not report results for Electricity), we impute with the other best existing model on that dataset.
  • Figure 5: Zero-shot performance on the Gift-Eval benchmark for (a) long sequences (average length at least 2048) and (b) short sequences.
  • ...and 2 more figures