Powerformer: A Transformer with Weighted Causal Attention for Time-series Forecasting
Kareem Hegazy, Michael W. Mahoney, N. Benjamin Erichson
TL;DR
Powerformer addresses the mismatch between Transformer attention and time-series causality by introducing Weighted Causal Multihead Attention (WCMHA), which combines a causal mask with a decaying locality mask to favor temporally local dependencies. The encoder-only Powerformer architecture employs PatchTST-style patching and applies WCMHA to encoder self-attention (excluding decoder cross-attention), enabling a simple, interpretable model with potential linear-time attention up to a cutoff. Empirically, Powerformer achieves state-of-the-art results across seven public datasets, with ablations showing that power-law-based locality biases (PL/SPL) outperform Butterworth-based schemes and that the learned biases act as regularizers guiding temporal dependencies. The paper also provides interpretability evidence, showing how the induced locality shapes attention patterns and enhances understanding of dataset-specific temporal structure, offering a principled baseline for time-series forecasting. The practical impact lies in delivering a strong, efficient baseline that combines domain-specific inductive biases with Transformer architecture, improving forecasting accuracy while enabling clearer insights into temporal dependencies and potential speedups via a controllable cutoff.
Abstract
Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model's locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.
