Table of Contents
Fetching ...

Generative Pretrained Hierarchical Transformer for Time Series Forecasting

Zhiding Liu, Jiqian Yang, Mingyue Cheng, Yucong Luo, Zhi Li

TL;DR

GPHT addresses the limitations of single-dataset pretraining and one-step forecasting by pretraining a unified, generative, auto-regressive transformer on a mixed dataset built under a channel-independent assumption. Its hierarchical, multi-stage encoder captures temporal patterns at multiple resolutions, while iterative residual learning refines forecasts across stages. By formulating forecasting as language-model-like next-token prediction and enabling horizon-agnostic inference, GPHT achieves strong generalization across eight datasets and exhibits notable zero-shot and few-shot performance gains. The results demonstrate that mixing diverse time series for pretraining and using a conditional autoregressive decoding strategy can yield robust, transfer-ready forecasters with practical inference benefits.

Abstract

Recent efforts have been dedicated to enhancing time series forecasting accuracy by introducing advanced network architectures and self-supervised pretraining strategies. Nevertheless, existing approaches still exhibit two critical drawbacks. Firstly, these methods often rely on a single dataset for training, limiting the model's generalizability due to the restricted scale of the training data. Secondly, the one-step generation schema is widely followed, which necessitates a customized forecasting head and overlooks the temporal dependencies in the output series, and also leads to increased training costs under different horizon length settings. To address these issues, we propose a novel generative pretrained hierarchical transformer architecture for forecasting, named \textbf{GPHT}. There are two aspects of key designs in GPHT. On the one hand, we advocate for constructing a mixed dataset under the channel-independent assumption for pretraining our model, comprising various datasets from diverse data scenarios. This approach significantly expands the scale of training data, allowing our model to uncover commonalities in time series data and facilitating improved transfer to specific datasets. On the other hand, GPHT employs an auto-regressive forecasting approach, effectively modeling temporal dependencies in the output series. Importantly, no customized forecasting head is required, enabling \textit{a single model to forecast at arbitrary horizon settings.} We conduct sufficient experiments on eight datasets with mainstream self-supervised pretraining models and supervised models. The results demonstrated that GPHT surpasses the baseline models across various fine-tuning and zero/few-shot learning settings in the traditional long-term forecasting task. We make our codes publicly available\footnote{https://github.com/icantnamemyself/GPHT}.

Generative Pretrained Hierarchical Transformer for Time Series Forecasting

TL;DR

GPHT addresses the limitations of single-dataset pretraining and one-step forecasting by pretraining a unified, generative, auto-regressive transformer on a mixed dataset built under a channel-independent assumption. Its hierarchical, multi-stage encoder captures temporal patterns at multiple resolutions, while iterative residual learning refines forecasts across stages. By formulating forecasting as language-model-like next-token prediction and enabling horizon-agnostic inference, GPHT achieves strong generalization across eight datasets and exhibits notable zero-shot and few-shot performance gains. The results demonstrate that mixing diverse time series for pretraining and using a conditional autoregressive decoding strategy can yield robust, transfer-ready forecasters with practical inference benefits.

Abstract

Recent efforts have been dedicated to enhancing time series forecasting accuracy by introducing advanced network architectures and self-supervised pretraining strategies. Nevertheless, existing approaches still exhibit two critical drawbacks. Firstly, these methods often rely on a single dataset for training, limiting the model's generalizability due to the restricted scale of the training data. Secondly, the one-step generation schema is widely followed, which necessitates a customized forecasting head and overlooks the temporal dependencies in the output series, and also leads to increased training costs under different horizon length settings. To address these issues, we propose a novel generative pretrained hierarchical transformer architecture for forecasting, named \textbf{GPHT}. There are two aspects of key designs in GPHT. On the one hand, we advocate for constructing a mixed dataset under the channel-independent assumption for pretraining our model, comprising various datasets from diverse data scenarios. This approach significantly expands the scale of training data, allowing our model to uncover commonalities in time series data and facilitating improved transfer to specific datasets. On the other hand, GPHT employs an auto-regressive forecasting approach, effectively modeling temporal dependencies in the output series. Importantly, no customized forecasting head is required, enabling \textit{a single model to forecast at arbitrary horizon settings.} We conduct sufficient experiments on eight datasets with mainstream self-supervised pretraining models and supervised models. The results demonstrated that GPHT surpasses the baseline models across various fine-tuning and zero/few-shot learning settings in the traditional long-term forecasting task. We make our codes publicly available\footnote{https://github.com/icantnamemyself/GPHT}.
Paper Structure (27 sections, 7 equations, 5 figures, 5 tables)

This paper contains 27 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the proposed GPHT model with two key features: (a) GPHT forecasts in an auto-regressive manner on the time series tokens. (b) Pretrained on the mixed dataset with the multi-stage hierarchical transformer blocks, GPHT excels in capturing the commonalities among time series originating from various data scenarios.
  • Figure 2: Performance comparison between GHPT with different stages of hierarchical transformer blocks.
  • Figure 3: MAE evaluation between GPHT and GPHT without pretraining on benchmark datasets.
  • Figure 4: Visualization of the input and corresponding output series of GPHT's multiple stages on a sample from the ETTh1 dataset.
  • Figure 5: Illustration of forecasting showcases comparing GPHT and baseline models. The lookback window is set to 336 and the forecasting horizon is set to 336, 192, 96 for the Exchange, Traffic, and ETTh1 dataset respectively.