Table of Contents
Fetching ...

RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting

Suhan Guo, Jiahong Deng, Yi Wei, Hui Dou, Furao Shen, Jian Zhao

TL;DR

The paper addresses the high computational cost of attention in multivariate time series forecasting by proposing RAM, a method that replaces attention with an MLP-based structure. RAM demonstrates that Q, K, V projections and attention mappings can be pruned without substantial loss, reducing FLOPs substantially while maintaining competitive accuracy in both spatio-temporal and long-term forecasting tasks. It introduces an abstract AMTSFM framework and shows that the encoder/decoder attention modules are not equally critical, with feedforward and residual components driving the MLP’s performance. The approach has practical implications for deploying efficient forecasting models on resource-constrained devices and prompts broader questions about the necessity of attention in time-series models across domains.

Abstract

Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. In this work, we propose a novel pruning strategy, $\textbf{R}$eplace $\textbf{A}$ttention with $\textbf{M}$LP (RAM), that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projections, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance, so that the given network remains the top-tier compared to other SOTA methods. RAM achieves a $62.579\%$ reduction in FLOPs for spatio-temporal models with less than $2.5\%$ performance drop, and a $42.233\%$ FLOPs reduction for LTSF models with less than $2\%$ performance drop.

RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting

TL;DR

The paper addresses the high computational cost of attention in multivariate time series forecasting by proposing RAM, a method that replaces attention with an MLP-based structure. RAM demonstrates that Q, K, V projections and attention mappings can be pruned without substantial loss, reducing FLOPs substantially while maintaining competitive accuracy in both spatio-temporal and long-term forecasting tasks. It introduces an abstract AMTSFM framework and shows that the encoder/decoder attention modules are not equally critical, with feedforward and residual components driving the MLP’s performance. The approach has practical implications for deploying efficient forecasting models on resource-constrained devices and prompts broader questions about the necessity of attention in time-series models across domains.

Abstract

Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. In this work, we propose a novel pruning strategy, eplace ttention with LP (RAM), that approximates the attention mechanism using only feedforward layers, residual connections, and layer normalization for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projections, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance, so that the given network remains the top-tier compared to other SOTA methods. RAM achieves a reduction in FLOPs for spatio-temporal models with less than performance drop, and a FLOPs reduction for LTSF models with less than performance drop.

Paper Structure

This paper contains 27 sections, 1 theorem, 20 equations, 4 figures, 9 tables.

Key Result

Lemma 3.1

Softmax is invariant to uniform distribution input. Let $\beta \geq 0$ be the temperature and $\{S_1, \dots, S_T\} \in \mathbb{R}^{T}$ be the collection of $T$ logits as a parameter and input to the softmax function. Bound the input logits with some $L, U \in \mathbb{R}$, we have $L \leq S_i \leq U,

Figures (4)

  • Figure 1: GCN as a modified attention mechanism in code implementation.
  • Figure 2: Comparison between Adaptive Making (Localization), Patching, and Attention Pruning strategies. For attention pruning, we remove the attention layer from the attention block, transforming it into an MLP block with feedforward, residual connection, and layer normalization layers.
  • Figure 3: Performance comparison of MLP layer replacement for spatial (SP) and temporal (TM) attention layers on PEMS04, 07, and 08 datasets. ‘Origin’ represents the original model, and ‘TM+SP’ denotes the model with both spatial and temporal attention layers replaced by MLP layers.
  • Figure 4: Performance comparison of MLP layer replacement for encoder (EN:TM+SP) and decoder (DE:TM+SP) attention layers on PEMS04, 07, and 08 datasets. ‘Origin’ represents the original model, and ‘EN+DE:TM+SP’ denotes the model with both encoder and decoder spatial and temporal attention layers replaced by MLP layers.

Theorems & Definitions (2)

  • Lemma 3.1
  • proof