Scalable Transformer for High Dimensional Multivariate Time Series Forecasting
Xin Zhou, Weiqing Wang, Wray Buntine, Shilin Qu, Abishek Sriramulu, Weicong Tan, Christoph Bergmeir
TL;DR
This work tackles high-dimensional multivariate time series forecasting, where traditional channel-dependent transformers struggle with noise and resource demands. The authors introduce STHD, a scalable Transformer framework that combines a sparsified relation matrix (via DeepGraph), a training strategy (ReIndex) to diversify batches and reduce memory, and a 2-D Transformer backbone capable of jointly modeling time and channel dependencies. Through analysis and extensive experiments on Crime-Chicago, Wiki-People, and Traffic, STHD achieves state-of-the-art performance, robustly handling large channel counts while mitigating noise from unrelated series. The approach offers a practical pathway toward accurate, scalable forecasting in real-world high-dimensional MTS settings, with public code and data available for reproduction.
Abstract
Deep models for Multivariate Time Series (MTS) forecasting have recently demonstrated significant success. Channel-dependent models capture complex dependencies that channel-independent models cannot capture. However, the number of channels in real-world applications outpaces the capabilities of existing channel-dependent models, and contrary to common expectations, some models underperform the channel-independent models in handling high-dimensional data, which raises questions about the performance of channel-dependent models. To address this, our study first investigates the reasons behind the suboptimal performance of these channel-dependent models on high-dimensional MTS data. Our analysis reveals that two primary issues lie in the introduced noise from unrelated series that increases the difficulty of capturing the crucial inter-channel dependencies, and challenges in training strategies due to high-dimensional data. To address these issues, we propose STHD, the Scalable Transformer for High-Dimensional Multivariate Time Series Forecasting. STHD has three components: a) Relation Matrix Sparsity that limits the noise introduced and alleviates the memory issue; b) ReIndex applied as a training strategy to enable a more flexible batch size setting and increase the diversity of training data; and c) Transformer that handles 2-D inputs and captures channel dependencies. These components jointly enable STHD to manage the high-dimensional MTS while maintaining computational feasibility. Furthermore, experimental results show STHD's considerable improvement on three high-dimensional datasets: Crime-Chicago, Wiki-People, and Traffic. The source code and dataset are publicly available https://github.com/xinzzzhou/ScalableTransformer4HighDimensionMTSF.git.
