Table of Contents
Fetching ...

sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting

Jiaheng Yin, Zhengxin Shi, Jianshen Zhang, Xiaomin Lin, Yulin Huang, Yongzhi Qi, Wei Qi

TL;DR

The paper addresses the challenge that Transformer-based time-series forecasting often underperforms simple linear models on long horizons and lacks scalable inter-sequence modeling. It introduces sTransformer, which combines a Sequence and Temporal Convolutional Network (STCN) and a Sequence-guided Mask Attention (SeqMask) within the Transformer to capture both temporal/inter-sequence information and global feature interactions. Across five public multivariate datasets for long-term forecasting, sTransformer achieves state-of-the-art results, outperforming linear predictors and prior SOTA methods, and it also demonstrates strong performance on short-term forecasting and anomaly detection. The results validate the modular, scalable design and suggest a solid baseline for time-series tasks that can generalize across tasks.

Abstract

In recent years, numerous Transformer-based models have been applied to long-term time-series forecasting (LTSF) tasks. However, recent studies with linear models have questioned their effectiveness, demonstrating that simple linear layers can outperform sophisticated Transformer-based models. In this work, we review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data. The former offers scalability but falls short in capturing inter-sequential information, while the latter preprocesses time-series data but is challenging to use as a scalable module. We propose $\textbf{sTransformer}$, which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information. Additionally, we introduce a Sequence-guided Mask Attention mechanism to capture global feature information. Our approach ensures the capture of inter-sequential information while maintaining module scalability. We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results. We also conducted experiments on other time-series tasks, achieving strong performance. These demonstrate that Transformer-based structures remain effective and our model can serve as a viable baseline for time-series tasks.

sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting

TL;DR

The paper addresses the challenge that Transformer-based time-series forecasting often underperforms simple linear models on long horizons and lacks scalable inter-sequence modeling. It introduces sTransformer, which combines a Sequence and Temporal Convolutional Network (STCN) and a Sequence-guided Mask Attention (SeqMask) within the Transformer to capture both temporal/inter-sequence information and global feature interactions. Across five public multivariate datasets for long-term forecasting, sTransformer achieves state-of-the-art results, outperforming linear predictors and prior SOTA methods, and it also demonstrates strong performance on short-term forecasting and anomaly detection. The results validate the modular, scalable design and suggest a solid baseline for time-series tasks that can generalize across tasks.

Abstract

In recent years, numerous Transformer-based models have been applied to long-term time-series forecasting (LTSF) tasks. However, recent studies with linear models have questioned their effectiveness, demonstrating that simple linear layers can outperform sophisticated Transformer-based models. In this work, we review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data. The former offers scalability but falls short in capturing inter-sequential information, while the latter preprocesses time-series data but is challenging to use as a scalable module. We propose , which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information. Additionally, we introduce a Sequence-guided Mask Attention mechanism to capture global feature information. Our approach ensures the capture of inter-sequential information while maintaining module scalability. We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results. We also conducted experiments on other time-series tasks, achieving strong performance. These demonstrate that Transformer-based structures remain effective and our model can serve as a viable baseline for time-series tasks.
Paper Structure (27 sections, 12 equations, 4 figures, 5 tables)

This paper contains 27 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: sTransformer block overview. STCN and SeqMask are introduced into the traditional Transformer structure. STCN extracts information from both sequence and temporal aspects. SeqMask interacts features of the Value layer with global features, enhancing global representation capability.
  • Figure 2: STCN. The left part is the TCN structure, and the right part is the SCN structure. TCN performs convolution along the temporal dimension, receiving information from previous time steps at each position of each dilation layer. SCN performs convolution along the sequence dimension, using padding through concatenation. In TCN, layers employ different value of dilation, while in SCN, layers use varying convolution kernel sizes. In each layer of TCN and SCN, two sets and three sets of convolutional blocks are integrated respectively. Notably, due to the temporal property, the convolutions in TCN are causal.
  • Figure 3: Sequence-Guided Mask Attention. This structure extracts contextual features from the embedding inputs ($x_{1,:},x_{2,:},\dots, x_{M,:}$). These features are multiplied by the information directly obtained from the original features through a Sequence-Guided Mask (SG_Mask) to produce interaction information. The final representation $V_n$, containing global interaction information, is obtained through iterations of $n$ blocks.
  • Figure 4: Parameter sensitivity. The figure shows the prediction performance of our model with different parameter values on four datasets. The parameters include lookback length, learning rate, embedding size, and block number.