Attention as an RNN

Leo Feng; Frederick Tung; Hossein Hajimirsadeghi; Mohamed Osama Ahmed; Yoshua Bengio; Greg Mori

Attention as an RNN

Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, Greg Mori

TL;DR

This work tackles the high computational cost of Transformer attention by reframing attention as a recurrent process. It introduces a parallel prefix-scan based mechanism to compute attention as a many-to-many RNN and then proposes Aaren, an attention module that combines parallel trainability with constant-memory, online updates. Across 38 datasets spanning reinforcement learning and time-series domains, Aarens achieve performance comparable to Transformers while delivering significant gains in time and memory efficiency. The approach offers a practical path toward deploying efficient sequence models in low-resource and streaming settings, with broad potential impact on real-time inference and edge devices.

Abstract

The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its \textit{many-to-one} RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention's \textit{many-to-many} RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce \textbf{Aaren}, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

Attention as an RNN

TL;DR

Abstract

datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

Paper Structure (30 sections, 14 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 14 equations, 5 figures, 5 tables, 1 algorithm.

Introduction
Background
Recurrent Neural Networks
Attention
Methodology
Attention as a (many-to-one) RNN
Attention as a (many-to-many) RNN
Aaren: Attention as a Recurrent Neural Network
Experiments
Reinforcement Learning
Event Forecasting
Time Series Forecasting
Time Series Classification
Analyses
Related Work
...and 15 more sections

Figures (5)

Figure 1: Attention as a many-to-one RNN. The query tokens are the initial hidden states of the RNNs. (a) The conventional method of computing attention only computes its final output. As such, it can be viewed as a method of computing attention's many-to-one RNN output. (b) Transformer's self-attention vaswani2017attention uses the input tokens as the initial hidden states. (c) Perceiver's cross-attention jaegle2021perceiver uses input-dependent latents as the initial hidden states.
Figure 2: Attention's RNN Cell.
Figure 3: Attention as a many-to-many RNN
Figure 5: Computational Resources Plots comparing Aarens and Transformers (using KV-caching) when processing a sequence of tokens. (Left) Memory Usage Comparison. (Right) Cumulative Time Comparison.
Figure : Parallel Prefix Scan (Hillis1986DataPA's variation)

Attention as an RNN

TL;DR

Abstract

Attention as an RNN

Authors

TL;DR

Abstract

Table of Contents

Figures (5)