Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

Xin Zhou; Weiqing Wang; Wray Buntine; Shilin Qu; Abishek Sriramulu; Weicong Tan; Christoph Bergmeir

Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

Xin Zhou, Weiqing Wang, Wray Buntine, Shilin Qu, Abishek Sriramulu, Weicong Tan, Christoph Bergmeir

TL;DR

This work tackles high-dimensional multivariate time series forecasting, where traditional channel-dependent transformers struggle with noise and resource demands. The authors introduce STHD, a scalable Transformer framework that combines a sparsified relation matrix (via DeepGraph), a training strategy (ReIndex) to diversify batches and reduce memory, and a 2-D Transformer backbone capable of jointly modeling time and channel dependencies. Through analysis and extensive experiments on Crime-Chicago, Wiki-People, and Traffic, STHD achieves state-of-the-art performance, robustly handling large channel counts while mitigating noise from unrelated series. The approach offers a practical pathway toward accurate, scalable forecasting in real-world high-dimensional MTS settings, with public code and data available for reproduction.

Abstract

Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD's considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available https://github.com/xinzzzhou/ScalableTransformer4HighDimensionMTSF.git.

Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

TL;DR

Abstract

Paper Structure (35 sections, 5 equations, 8 figures, 5 tables)

This paper contains 35 sections, 5 equations, 8 figures, 5 tables.

Introduction
Related Work
Channel-independent Model
Channel-dependent Model
Initial Analysis
Scalable Transformer STHD
Problem Formulation
Overview of STHD
Relation Matrix Sparsity
ReIndex
2-D Transformer
Patches
Encoder
Attention Mechanism
Decoder
...and 20 more sections

Figures (8)

Figure 1: Result of using different proportions of training data on Crossformer and iTransformer. The x-axis describes proportions from 0% to 100%. The y-axis in (a) and (c) represents RMSE on Crossformer and iTransformer (iTransfor for short), and the y-axis in (b) and (d) represents WRMSPE on Crossformer and iTransformer. Lines in yellow, black, grey, and blue represent the horizon of {6,12,18,24} months.
Figure 2: Visualisation of the target series in red, and its related series/non-related series in gray and black.
Figure 3: Overview of STHD, correlations of Multivariate Time Series data from different aspects are computed and then sparsed. That is, each series reserves the top $K$ auxiliary series. Then, the target series together with $K$ auxiliary series are input to the 2-D Transformer Backbone, which can extract the representation of 2-D input.
Figure 4: ReIndex process before batch sampling. Yellow square represents a series sample, including one target series, and $K$ related series. All $M \times (1+K)$ series are split to $M \times (1+K) \times S$ subseries, and each of length $L$. Existing works sample batches on the $S$ dimension. For example, the grey square in the middle of the figure denotes a batch of samples, with number of samples $b \times M \times (1+K)$, which largely increases memory usage. ReIndex reshapes the windows to $(M\times S) \times (1+K)$ and samples batches on the $M \times S$ dimension. The number of samples of each batch is $b\times (1+K)$.
Figure 5: Effect of Related Series on Crime-Chicago. Yellow, blue and grey bars represent the target series with $K$ related series, with $K$ unrelated series, and without auxiliary series.
...and 3 more figures

Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

TL;DR

Abstract

Scalable Transformer for High Dimensional Multivariate Time Series Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (8)