Table of Contents
Fetching ...

AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction

Qianru Zhang, Honggang Wen, Ming Li, Dong Huang, Siu-Ming Yiu, Christian S. Jensen, Pietro Liò

TL;DR

AutoHFormer tackles the triad of strict causality, computational efficiency, and multi-scale forecasting in time-series data by introducing a hierarchical autoregressive Transformer with Dynamic Windowed Masked Attention and adaptive temporal encoding. It achieves sub-quadratic complexity $O(LW)$ through segment-level parallel prediction and intra-segment refinement, while preserving temporal coherence via learnable causal windows and exponential decay. Relative position encodings and PPEs further enhance temporal awareness, enabling robust long-horizon forecasting (e.g., up to $T=720$ steps) with reduced memory and faster training compared to baselines like PatchTST. Across diverse datasets, AutoHFormer demonstrates state-of-the-art accuracy, strong robustness to noise, and superior scalability, making it a practical solution for real-world forecasting in energy, traffic, and related domains.

Abstract

Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime.

AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction

TL;DR

AutoHFormer tackles the triad of strict causality, computational efficiency, and multi-scale forecasting in time-series data by introducing a hierarchical autoregressive Transformer with Dynamic Windowed Masked Attention and adaptive temporal encoding. It achieves sub-quadratic complexity through segment-level parallel prediction and intra-segment refinement, while preserving temporal coherence via learnable causal windows and exponential decay. Relative position encodings and PPEs further enhance temporal awareness, enabling robust long-horizon forecasting (e.g., up to steps) with reduced memory and faster training compared to baselines like PatchTST. Across diverse datasets, AutoHFormer demonstrates state-of-the-art accuracy, strong robustness to noise, and superior scalability, making it a practical solution for real-world forecasting in energy, traffic, and related domains.

Abstract

Time series forecasting requires architectures that simultaneously achieve three competing objectives: (1) strict temporal causality for reliable predictions, (2) sub-quadratic complexity for practical scalability, and (3) multi-scale pattern recognition for accurate long-horizon forecasting. We introduce AutoHFormer, a hierarchical autoregressive transformer that addresses these challenges through three key innovations: 1) Hierarchical Temporal Modeling: Our architecture decomposes predictions into segment-level blocks processed in parallel, followed by intra-segment sequential refinement. This dual-scale approach maintains temporal coherence while enabling efficient computation. 2) Dynamic Windowed Attention: The attention mechanism employs learnable causal windows with exponential decay, reducing complexity while preserving precise temporal relationships. This design avoids both the anti-causal violations of standard transformers and the sequential bottlenecks of RNN hybrids. 3) Adaptive Temporal Encoding: a novel position encoding system is adopted to capture time patterns at multiple scales. It combines fixed oscillating patterns for short-term variations with learnable decay rates for long-term trends. Comprehensive experiments demonstrate that AutoHFormer 10.76X faster training and 6.06X memory reduction compared to PatchTST on PEMS08, while maintaining consistent accuracy across 96-720 step horizons in most of cases. These breakthroughs establish new benchmarks for efficient and precise time series modeling. Implementations of our method and all baselines in hierarchical autoregressive mechanism are available at https://github.com/lizzyhku/Autotime.

Paper Structure

This paper contains 31 sections, 4 theorems, 14 equations, 9 figures, 8 tables, 1 algorithm.

Key Result

Theorem II.1

The Dynamic Windowed Masked Attention (DWMA) mechanism converges to an optimal attention distribution as the sequence length $L \to \infty$, provided the time decay factor $\gamma$ is chosen appropriately.

Figures (9)

  • Figure 1: Architectural comparison of time series modeling approaches: (a) Conventional Transformer suffers from anti-causal attention flows (red dashed arrows) that violate the fundamental autoregressive principle $p(x_t|x_{<t})$. (b) RNN-Transformer Hybrid enforces causality through sequential processing (orange solid arrows) but introduces an $\mathcal{O}(L)$ computational bottleneck that prevents parallel training. (c) Ours (our solution) combines: ① Strictly causal attention within a sliding window $W$ (blue shaded region), ② Exponentially decaying attention weights $\tau(t,t')=e^{-|t-t'|/\gamma}$ (visualized by arrow opacity gradient), and ③ $\mathcal{O}(LW)$ complexity through windowed parallel processing. The thickness of blue arrows represents attention magnitude, demonstrating our model's ability to simultaneously maintain temporal causality while enabling efficient parallel computation.
  • Figure 2: The Overview of AutoHFormer. The left part is the input time series data. The right part is the prediction. For the middle part, (a) the top component is hierarchical autoregressive mechanism including segment-level part and step-wise part. (b) For the bottom part of the middle, there are three key components: (1) a dynamic windowed attention mechanism that computes localized attention patterns with adaptive window sizing, (2) precomputed relative position encodings that capture temporal relationships through sinusoidal embeddings, and (3) a series of several sub-models (efficient transformer) with layer normalization that progressively refine feature representations while maintaining strict causality. The complete system transforms raw input sequences into accurate predictions through this optimized transformer-based pipeline, achieving both computational efficiency and modeling precision.
  • Figure 3: Visualization of Attention Weights in AutoHFormer
  • Figure 4: Scalability study when batch size is set as 32.
  • Figure 5: Performance comparison of robustness on ETTm1
  • ...and 4 more figures

Theorems & Definitions (8)

  • Theorem II.1
  • proof
  • Theorem II.2
  • proof
  • Theorem II.3
  • proof
  • Theorem II.4: Generalization Error Bound, Extended
  • proof : Extended Proof