Table of Contents
Fetching ...

MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing

Zhe Li, Zhongwen Rao, Lujia Pan, Zenglin Xu

TL;DR

This work questions the necessity of attention in Transformer-based multivariate time series forecasting and introduces MTS-Mixers, a framework that factorizes temporal and channel interactions to exploit the low-rank structure of real-world data. By learning separate temporal and channel dependencies and focusing on the mapping from input history to future sequences, MTS-Mixers achieve state-of-the-art performance with improved efficiency across diverse real-world datasets. Ablation studies show that temporal and channel factorization, especially via factorized MLP, yield significant gains and that attention is not essential for capturing temporal dependencies. The results highlight the practical impact of decoupling temporal and channel processing and of optimizing the input–output mapping for scalable forecasting in real-world applications.

Abstract

Multivariate time series forecasting has been widely used in various practical scenarios. Recently, Transformer-based models have shown significant potential in forecasting tasks due to the capture of long-range dependencies. However, recent studies in the vision and NLP fields show that the role of attention modules is not clear, which can be replaced by other token aggregation operations. This paper investigates the contributions and deficiencies of attention mechanisms on the performance of time series forecasting. Specifically, we find that (1) attention is not necessary for capturing temporal dependencies, (2) the entanglement and redundancy in the capture of temporal and channel interaction affect the forecasting performance, and (3) it is important to model the mapping between the input and the prediction sequence. To this end, we propose MTS-Mixers, which use two factorized modules to capture temporal and channel dependencies. Experimental results on several real-world datasets show that MTS-Mixers outperform existing Transformer-based models with higher efficiency.

MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing

TL;DR

This work questions the necessity of attention in Transformer-based multivariate time series forecasting and introduces MTS-Mixers, a framework that factorizes temporal and channel interactions to exploit the low-rank structure of real-world data. By learning separate temporal and channel dependencies and focusing on the mapping from input history to future sequences, MTS-Mixers achieve state-of-the-art performance with improved efficiency across diverse real-world datasets. Ablation studies show that temporal and channel factorization, especially via factorized MLP, yield significant gains and that attention is not essential for capturing temporal dependencies. The results highlight the practical impact of decoupling temporal and channel processing and of optimizing the input–output mapping for scalable forecasting in real-world applications.

Abstract

Multivariate time series forecasting has been widely used in various practical scenarios. Recently, Transformer-based models have shown significant potential in forecasting tasks due to the capture of long-range dependencies. However, recent studies in the vision and NLP fields show that the role of attention modules is not clear, which can be replaced by other token aggregation operations. This paper investigates the contributions and deficiencies of attention mechanisms on the performance of time series forecasting. Specifically, we find that (1) attention is not necessary for capturing temporal dependencies, (2) the entanglement and redundancy in the capture of temporal and channel interaction affect the forecasting performance, and (3) it is important to model the mapping between the input and the prediction sequence. To this end, we propose MTS-Mixers, which use two factorized modules to capture temporal and channel dependencies. Experimental results on several real-world datasets show that MTS-Mixers outperform existing Transformer-based models with higher efficiency.
Paper Structure (21 sections, 9 equations, 10 figures, 9 tables)

This paper contains 21 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The overall architecture of Transformer-based models for time series forecasting. Notice that the generation of the prediction sequence in the decoder is non-autoregressive.
  • Figure 2: ETTh1 Zhou2021InformerBE forecasting results of the modifications on Transformer and variants of Fourier-Net at 96-96 setting (The length of the historical horizon is set as 96 and the prediction length is 96). The higher R-squared score indicates better performance.
  • Figure 3: The redundancy of existing multivariate time series data. top: Exchange rate under different sampling rates. bottom: Electricity consumption of three consumers.
  • Figure 4: The overall architecture of MTS-Mixers. Left: the modules in the dashed box describe the general framework. Right: three specific implementations, where we can use attention, random matrix, or factorized MLP to capture dependencies.
  • Figure 5: The impact of the hyper-parameter $s\in\{1,2,3,4,6,8,12\}$ which corresponds to the number of interleaved subsequences after downsampling the original time series data under the 96-96 setting. A lower MSE means better performance.
  • ...and 5 more figures