Table of Contents
Fetching ...

Powerformer: A Transformer with Weighted Causal Attention for Time-series Forecasting

Kareem Hegazy, Michael W. Mahoney, N. Benjamin Erichson

TL;DR

Powerformer addresses the mismatch between Transformer attention and time-series causality by introducing Weighted Causal Multihead Attention (WCMHA), which combines a causal mask with a decaying locality mask to favor temporally local dependencies. The encoder-only Powerformer architecture employs PatchTST-style patching and applies WCMHA to encoder self-attention (excluding decoder cross-attention), enabling a simple, interpretable model with potential linear-time attention up to a cutoff. Empirically, Powerformer achieves state-of-the-art results across seven public datasets, with ablations showing that power-law-based locality biases (PL/SPL) outperform Butterworth-based schemes and that the learned biases act as regularizers guiding temporal dependencies. The paper also provides interpretability evidence, showing how the induced locality shapes attention patterns and enhances understanding of dataset-specific temporal structure, offering a principled baseline for time-series forecasting. The practical impact lies in delivering a strong, efficient baseline that combines domain-specific inductive biases with Transformer architecture, improving forecasting accuracy while enabling clearer insights into temporal dependencies and potential speedups via a controllable cutoff.

Abstract

Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model's locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.

Powerformer: A Transformer with Weighted Causal Attention for Time-series Forecasting

TL;DR

Powerformer addresses the mismatch between Transformer attention and time-series causality by introducing Weighted Causal Multihead Attention (WCMHA), which combines a causal mask with a decaying locality mask to favor temporally local dependencies. The encoder-only Powerformer architecture employs PatchTST-style patching and applies WCMHA to encoder self-attention (excluding decoder cross-attention), enabling a simple, interpretable model with potential linear-time attention up to a cutoff. Empirically, Powerformer achieves state-of-the-art results across seven public datasets, with ablations showing that power-law-based locality biases (PL/SPL) outperform Butterworth-based schemes and that the learned biases act as regularizers guiding temporal dependencies. The paper also provides interpretability evidence, showing how the induced locality shapes attention patterns and enhances understanding of dataset-specific temporal structure, offering a principled baseline for time-series forecasting. The practical impact lies in delivering a strong, efficient baseline that combines domain-specific inductive biases with Transformer architecture, improving forecasting accuracy while enabling clearer insights into temporal dependencies and potential speedups via a controllable cutoff.

Abstract

Transformers have recently shown strong performance in time-series forecasting, but their all-to-all attention mechanism overlooks the (temporal) causal and often (temporally) local nature of data. We introduce Powerformer, a novel Transformer variant that replaces noncausal attention weights with causal weights that are reweighted according to a smooth heavy-tailed decay. This simple yet effective modification endows the model with an inductive bias favoring temporally local dependencies, while still allowing sufficient flexibility to learn the unique correlation structure of each dataset. Our empirical results demonstrate that Powerformer not only achieves state-of-the-art accuracy on public time-series benchmarks, but also that it offers improved interpretability of attention patterns. Our analyses show that the model's locality bias is amplified during training, demonstrating an interplay between time-series data and power-law-based attention. These findings highlight the importance of domain-specific modifications to the Transformer architecture for time-series forecasting, and they establish Powerformer as a strong, efficient, and principled baseline for future research and real-world applications.

Paper Structure

This paper contains 38 sections, 14 equations, 55 figures, 10 tables.

Figures (55)

  • Figure 1: Illustration of Powerformer and the Weighted Causal Multihead Attention (WCMHA) architecture, as well as their effects on attention weights. Panel (a) shows the Powerformer architecture (left) and the WCMHA (right). Panels (b) and (c) show the attention weights without and with our local-causal mask, respectively. Here, $\Sigma$ corresponds to the softmax function.
  • Figure 2: We show the weight power-law (solid line) and similarity power-law (dashed line) masks for varying $\alpha$. Panel (a) shows the contribution added to the attention scores and Panel (b) shows the subsequent effects on the attention weights after applying Softmax.
  • Figure 3: We show the attention score and weight distributions for both the benchmark Transformer (dotted black line) with MHA and our modified Transformer with WCMHA and $f^{(\text{PL})}(t)$ (solid colored lines). Panels (a), (b), and (c) correspond to the last encoder self-attention, decoder self-attention, and decoder cross-attention layers, respectively. The colored lines correspond to different mask decay times $(\alpha)$. These results are from the Electricity dataset with a 96 prediction length and 512 input length.
  • Figure 4: We show Powerformer's attention score and weight distributions with MHA (dotted line) and with WCMHA (solid lines) for $f^{(\text{PL})}(t)$. The colored lines correspond to different mask decay times $(\alpha)$. These results are for the Weather dataset with a 96 prediction length and 512 input length.
  • Figure 5: We show the causal and local biases' implicit and explicit effects on Powerformer's attention score and weight distributions. The reference (dotted line) has no mask (MHA), the solid line uses WCMHA and has the mask applied, and the dashed-dotted line is the WCMHA distribution calculated before applying the mask. This result are for the Weather dataset with a 96 prediction length and 512 input length.
  • ...and 50 more figures