Table of Contents
Fetching ...

HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting

Shibo Feng, Peilin Zhao, Liu Liu, Pengcheng Wu, Zhiqi Shen

TL;DR

HDT addresses high-dimensional multivariate time series forecasting with long horizons by converting targets to discrete tokens through a two-stage vector-quantized process and applying a two-level, self-conditioned Transformer to model priors over these tokens. The low-level stage captures long-term trends from downsampled representations, while the high-level stage generates full target tokens conditioned on these trends, enabling long-range accuracy and fast inference. Empirical results across five real-world datasets show substantial improvements in probabilistic and deterministic forecasts over state-of-the-art baselines, with notable gains over VQ-TR and diffusion-based methods in high-dimensional settings. The approach offers improved scalability and practical impact for high-dimensional MTS forecasting, with future work exploring unified discretization and multimodal integration.

Abstract

Generative models have gained significant attention in multivariate time series forecasting (MTS), particularly due to their ability to generate high-fidelity samples. Forecasting the probability distribution of multivariate time series is a challenging yet practical task. Although some recent attempts have been made to handle this task, two major challenges persist: 1) some existing generative methods underperform in high-dimensional multivariate time series forecasting, which is hard to scale to higher dimensions; 2) the inherent high-dimensional multivariate attributes constrain the forecasting lengths of existing generative models. In this paper, we point out that discrete token representations can model high-dimensional MTS with faster inference time, and forecasting the target with long-term trends of itself can extend the forecasting length with high accuracy. Motivated by this, we propose a vector quantized framework called Hierarchical Discrete Transformer (HDT) that models time series into discrete token representations with l2 normalization enhanced vector quantized strategy, in which we transform the MTS forecasting into discrete tokens generation. To address the limitations of generative models in long-term forecasting, we propose a hierarchical discrete Transformer. This model captures the discrete long-term trend of the target at the low level and leverages this trend as a condition to generate the discrete representation of the target at the high level that introduces the features of the target itself to extend the forecasting length in high-dimensional MTS. Extensive experiments on five popular MTS datasets verify the effectiveness of our proposed method.

HDT: Hierarchical Discrete Transformer for Multivariate Time Series Forecasting

TL;DR

HDT addresses high-dimensional multivariate time series forecasting with long horizons by converting targets to discrete tokens through a two-stage vector-quantized process and applying a two-level, self-conditioned Transformer to model priors over these tokens. The low-level stage captures long-term trends from downsampled representations, while the high-level stage generates full target tokens conditioned on these trends, enabling long-range accuracy and fast inference. Empirical results across five real-world datasets show substantial improvements in probabilistic and deterministic forecasts over state-of-the-art baselines, with notable gains over VQ-TR and diffusion-based methods in high-dimensional settings. The approach offers improved scalability and practical impact for high-dimensional MTS forecasting, with future work exploring unified discretization and multimodal integration.

Abstract

Generative models have gained significant attention in multivariate time series forecasting (MTS), particularly due to their ability to generate high-fidelity samples. Forecasting the probability distribution of multivariate time series is a challenging yet practical task. Although some recent attempts have been made to handle this task, two major challenges persist: 1) some existing generative methods underperform in high-dimensional multivariate time series forecasting, which is hard to scale to higher dimensions; 2) the inherent high-dimensional multivariate attributes constrain the forecasting lengths of existing generative models. In this paper, we point out that discrete token representations can model high-dimensional MTS with faster inference time, and forecasting the target with long-term trends of itself can extend the forecasting length with high accuracy. Motivated by this, we propose a vector quantized framework called Hierarchical Discrete Transformer (HDT) that models time series into discrete token representations with l2 normalization enhanced vector quantized strategy, in which we transform the MTS forecasting into discrete tokens generation. To address the limitations of generative models in long-term forecasting, we propose a hierarchical discrete Transformer. This model captures the discrete long-term trend of the target at the low level and leverages this trend as a condition to generate the discrete representation of the target at the high level that introduces the features of the target itself to extend the forecasting length in high-dimensional MTS. Extensive experiments on five popular MTS datasets verify the effectiveness of our proposed method.

Paper Structure

This paper contains 19 sections, 13 equations, 11 figures, 12 tables, 3 algorithms.

Figures (11)

  • Figure 1: An illustration of our proposed HDT is provided. In stage 1, the model generates discrete downsampled targets and discrete targets, which are passed to Stage 2 for further processing. In stage 2, the contextual encoder and base Transformer decoder are trained with historical inputs and discrete downsampled tokens at the low level. Once trained, these low-level modules are fixed, and their outputs are fed into the high-level framework to generate the final discrete target sequence.
  • Figure 2: Performance of HDT with different temperature levels of different prediction lengths in Traffic and Taxi datasets. The comparison results against MG_TSD and VQ-TR with HDT on different levels of missing rate.
  • Figure 3: Probabilistic and deterministic performance of HDT and HDT-variants on different prediction length and datasets. HDT-var.T is the same structure with HDT without the self-conditions in stage 2. HDT-var.L replaces the Transformer with LSTM in stage 2 and without self-conditions.
  • Figure 4: Model memory usage and time efficiency comparison under input-96-predict-48, 96 of Traffic and Taxi, respectively.
  • Figure 5: Comparison of prediction intervals with TiemGrad and MG_TSG for the Taxi dataset, which comprise 1214 dimensions. The predicted median is displayed, along with visualization of the 50% and 90% distribution intervals. The blue line in the graph represents the ground truth of the test sample.
  • ...and 6 more figures