Table of Contents
Fetching ...

S2TX: Cross-Attention Multi-Scale State-Space Transformer for Time Series Forecasting

Zihao Wu, Juncheng Dong, Haoming Yang, Vahid Tarokh

TL;DR

This work tackles multivariate time-series forecasting by addressing cross-variate correlations and global-local interactions that are often modeled separately. It introduces S2TX, a cross-attention-based State-Space Transformer that uses a global Mamba module to extract long-range cross-variate context from coarse patches and a local patch-based Transformer to model short-range, variate-local patterns. A cross-attention mechanism fuses these contexts, enabling variate-level interactions and efficient global-local communication. Empirical results on seven benchmark datasets across multiple horizons demonstrate state-of-the-art performance with a low memory footprint and robustness to missing data.

Abstract

Time series forecasting has recently achieved significant progress with multi-scale models to address the heterogeneity between long and short range patterns. Despite their state-of-the-art performance, we identify two potential areas for improvement. First, the variates of the multivariate time series are processed independently. Moreover, the multi-scale (long and short range) representations are learned separately by two independent models without communication. In light of these concerns, we propose State Space Transformer with cross-attention (S2TX). S2TX employs a cross-attention mechanism to integrate a Mamba model for extracting long-range cross-variate context and a Transformer model with local window attention to capture short-range representations. By cross-attending to the global context, the Transformer model further facilitates variate-level interactions as well as local/global communications. Comprehensive experiments on seven classic long-short range time-series forecasting benchmark datasets demonstrate that S2TX can achieve highly robust SOTA results while maintaining a low memory footprint.

S2TX: Cross-Attention Multi-Scale State-Space Transformer for Time Series Forecasting

TL;DR

This work tackles multivariate time-series forecasting by addressing cross-variate correlations and global-local interactions that are often modeled separately. It introduces S2TX, a cross-attention-based State-Space Transformer that uses a global Mamba module to extract long-range cross-variate context from coarse patches and a local patch-based Transformer to model short-range, variate-local patterns. A cross-attention mechanism fuses these contexts, enabling variate-level interactions and efficient global-local communication. Empirical results on seven benchmark datasets across multiple horizons demonstrate state-of-the-art performance with a low memory footprint and robustness to missing data.

Abstract

Time series forecasting has recently achieved significant progress with multi-scale models to address the heterogeneity between long and short range patterns. Despite their state-of-the-art performance, we identify two potential areas for improvement. First, the variates of the multivariate time series are processed independently. Moreover, the multi-scale (long and short range) representations are learned separately by two independent models without communication. In light of these concerns, we propose State Space Transformer with cross-attention (S2TX). S2TX employs a cross-attention mechanism to integrate a Mamba model for extracting long-range cross-variate context and a Transformer model with local window attention to capture short-range representations. By cross-attending to the global context, the Transformer model further facilitates variate-level interactions as well as local/global communications. Comprehensive experiments on seven classic long-short range time-series forecasting benchmark datasets demonstrate that S2TX can achieve highly robust SOTA results while maintaining a low memory footprint.

Paper Structure

This paper contains 20 sections, 8 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the performance of different architectures over 7 different benchmark datasets. Average results (MSE) are reported.
  • Figure 2: A snippet of the weather dataset. Two variables (blue and green) were plotted over 720 time steps. The purple boxed region indicates where a global-local interaction exists, and the red boxed region indicates a cross-variate correlation.
  • Figure 3: Overview of the proposed architecture S2TX. Different variables (in different colors) of the time series are patched into global and local patches. The global patches are processed by the global model, which outputs the global context that is used to compute the key and value matrices during cross-attention with the local model. Skip connections and normalization layers are omitted for clarity of presentation.
  • Figure 4: Patch transforms a one-dimensional sequence to a sequence of patches.
  • Figure 5: Empirical time series versus predicted time series across different architecture. S2TX can better capture the variation of the variable over time.
  • ...and 2 more figures