AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction
Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò
TL;DR
AutoHFormer tackles the triad of strict causality, computational efficiency, and multi-scale forecasting in time-series data by introducing a hierarchical autoregressive Transformer with Dynamic Windowed Masked Attention and adaptive temporal encoding. It achieves sub-quadratic complexity $O(LW)$ through segment-level parallel prediction and intra-segment refinement, while preserving temporal coherence via learnable causal windows and exponential decay. Relative position encodings and PPEs further enhance temporal awareness, enabling robust long-horizon forecasting (e.g., up to $T=720$ steps) with reduced memory and faster training compared to baselines like PatchTST. Across diverse datasets, AutoHFormer demonstrates state-of-the-art accuracy, strong robustness to noise, and superior scalability, making it a practical solution for real-world forecasting in energy, traffic, and related domains.
Abstract
Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime.
