Table of Contents
Fetching ...

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, Xifeng Yan

TL;DR

This paper tackles time series forecasting by adapting Transformer architectures to address locality-insensitivity and memory bottlenecks. It introduces convolutional self-attention to inject local context and a LogSparse Transformer that achieves $O(L( obreak\log L)^2)$ memory, enabling fine-grained, long-horizon modeling under memory constraints. Through synthetic and real-world experiments, the approach demonstrates improved long-term dependency capture and competitive performance against state-of-the-art baselines, particularly in data with strong seasonal patterns. The results suggest that locality-aware and memory-efficient attention mechanisms can significantly enhance forecasting accuracy in practical, large-scale time-series tasks.

Abstract

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L)^{2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

TL;DR

This paper tackles time series forecasting by adapting Transformer architectures to address locality-insensitivity and memory bottlenecks. It introduces convolutional self-attention to inject local context and a LogSparse Transformer that achieves memory, enabling fine-grained, long-horizon modeling under memory constraints. Through synthetic and real-world experiments, the approach demonstrates improved long-term dependency capture and competitive performance against state-of-the-art baselines, particularly in data with strong seasonal patterns. The results suggest that locality-aware and memory-efficient attention mechanisms can significantly enhance forecasting accuracy in practical, large-scale time-series tasks.

Abstract

Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length , making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

Paper Structure

This paper contains 24 sections, 1 theorem, 6 equations, 6 figures, 5 tables.

Key Result

Theorem 1

$\forall l$ and $j\leq l$, there is at least one path from cell $j$ to cell $l$ if we stack $\left\lfloor\log_2l\right\rfloor+1$ layers. Moreover, for $j<l$, the number of feasible unique paths from cell $j$ to cell $l$ increases at a rate of $O(\left\lfloor \log_2 (l-j)\right\rfloor!)$.

Figures (6)

  • Figure 1: The comparison between canonical and our convolutional self-attention layers. "Conv, 1" and "Conv, $k$" mean convolution of kernel size {1, $k$} with stride 1, respectively. Canonical self-attention as used in Transformer is shown in (b), may wrongly match point-wise inputs as shown in (a). Convolutional self-attention is shown in (d), which uses convolutional layers of kernel size $k$ with stride 1 to transform inputs (with proper paddings) into queries/keys. Such locality awareness can correctly match the most relevant features based on shape matching in (c).
  • Figure 2: Learned attention patterns from a 10-layer canonical Transformer trained on traffic-f dataset with full attention. The green dashed line indicates the start time of forecasting and the gray dashed line on its left side is the conditional history. Blue, cyan and red lines correspond to attention patterns in layer 2, 6 and 10, respectively, for a head when predicting the value at the time corresponding to the green dashed line. a) Layer 2 tends to learn shared patterns in every day. b) Layer 6 focuses more on weekend patterns. c) Layer 10 further squeezes most of its attention on only several cells in weekends, causing most of the others to receive little attention.
  • Figure 3: Illustration of different attention mechanism between adjacent layers in Transformer.
  • Figure 4: (a) An example time series with $t_{0} = 96$. Black line is the conditional history while red dashed line is the target. (b) Performance comparison between DeepAR and canonical Transformer along with the growth of $t_{0}.$ The larger $t_{0}$ is, the longer dependencies the models need to capture for accurate forecasting.
  • Figure 5: Training curve comparison (with proper smoothing) among kernel size $k \in \{1,3, 9\}$ in traffic-c (left) and electricity-c (right) dataset. Being aware of larger local context size, the model can achieve lower training error and converge faster.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof