Table of Contents
Fetching ...

Echo State Transformer: Attention Over Finite Memories

Yannis Bendi-Ouis, Xavier Hinaut

TL;DR

The paper introduces Echo State Transformer (EST), a hybrid architecture that replaces attention over growing sequences with attention over a fixed set of reservoir-based memory units, achieving linear-time scalability and persistent temporal representations. By integrating a Working Memory module with an adaptive leak-rate mechanism, EST leverages Reservoir Computing dynamics alongside Transformer attention, trained end-to-end on the readout. On the Time Series Library benchmark, EST achieves state-of-the-art results in classification and anomaly detection, while maintaining competitive short-term forecasting and highlighting limitations in long-horizon reconstruction due to its memory design and BPTT training. The work provides a practical, memory-efficient alternative to pure Transformer models for time-series tasks that require robust event detection and memory of rare patterns, and it outlines directions for removing recurrence to enable full sequence parallelization.

Abstract

While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language and working memory. Furthermore, sequential data processing with Transformers encounters a fundamental barrier: quadratic complexity growth with sequence length. Motivated by these limitations, our ambition is to create more efficient models that are less reliant on intensive computations. We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves this challenge while demonstrating exceptional performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixed-size window distributed memory system. Drawing inspiration from Echo State Networks, the most prominent instance of the Reservoir Computing paradigm, our approach leverages reservoirs (random recurrent networks) as a lightweight and efficient memory. Our architecture integrates a new module called ''Working Memory'' based on several reservoirs working in parallel. These reservoirs work as independent working memory units with distinct internal dynamics. A novelty here is that the classical reservoir hyperparameters, controlling the dynamics, are now trained. Thus, the EST dynamically adapts the reservoir memory/non-linearity trade-off. Thanks to these working memory units, EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers. We evaluate ESTs on a recent challenging timeseries benchmark: the Time Series Library, which comprises 69 tasks across five categories. Results show that ESTs ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks, while remaining competitive on short-term forecasting. These results position ESTs as a compelling alternative for time-series classification and anomaly detection, and a practical complement to transformer-style models in applications that prioritize robust representations and sensitive event detection.

Echo State Transformer: Attention Over Finite Memories

TL;DR

The paper introduces Echo State Transformer (EST), a hybrid architecture that replaces attention over growing sequences with attention over a fixed set of reservoir-based memory units, achieving linear-time scalability and persistent temporal representations. By integrating a Working Memory module with an adaptive leak-rate mechanism, EST leverages Reservoir Computing dynamics alongside Transformer attention, trained end-to-end on the readout. On the Time Series Library benchmark, EST achieves state-of-the-art results in classification and anomaly detection, while maintaining competitive short-term forecasting and highlighting limitations in long-horizon reconstruction due to its memory design and BPTT training. The work provides a practical, memory-efficient alternative to pure Transformer models for time-series tasks that require robust event detection and memory of rare patterns, and it outlines directions for removing recurrence to enable full sequence parallelization.

Abstract

While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language and working memory. Furthermore, sequential data processing with Transformers encounters a fundamental barrier: quadratic complexity growth with sequence length. Motivated by these limitations, our ambition is to create more efficient models that are less reliant on intensive computations. We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves this challenge while demonstrating exceptional performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixed-size window distributed memory system. Drawing inspiration from Echo State Networks, the most prominent instance of the Reservoir Computing paradigm, our approach leverages reservoirs (random recurrent networks) as a lightweight and efficient memory. Our architecture integrates a new module called ''Working Memory'' based on several reservoirs working in parallel. These reservoirs work as independent working memory units with distinct internal dynamics. A novelty here is that the classical reservoir hyperparameters, controlling the dynamics, are now trained. Thus, the EST dynamically adapts the reservoir memory/non-linearity trade-off. Thanks to these working memory units, EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers. We evaluate ESTs on a recent challenging timeseries benchmark: the Time Series Library, which comprises 69 tasks across five categories. Results show that ESTs ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks, while remaining competitive on short-term forecasting. These results position ESTs as a compelling alternative for time-series classification and anomaly detection, and a practical complement to transformer-style models in applications that prioritize robust representations and sensitive event detection.

Paper Structure

This paper contains 32 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Comparison of standard Transformer (Decoder-Only) and Echo State Transformer architecture. We add a "Working Memory" block to the standard Transformer architecture and apply attention on it with the "Previous State Attention" block. This block computes Keys and Values from the previous state, and Queries from the input at time $t$.
  • Figure 2: Computation of attention in Transformers exemplified with matrix multiplications schemas. On the left the computation of Queries, Keys and Values. On the right, the application of the attention formula with the previous computed Queries, Keys and Values.
  • Figure 3: Echo State Network is composed of 3 layers: $W_{in}$ treats the input, $W$ computes the state update and $W_{out}$ computes the output. Only $W_{out}$ is trained via linear regression.
  • Figure 4: This figure display the mecanism of the Previous State Attention block. It produces Keys and Values from all memory units ($S_{out}$) and Queries from the embedding at time $t$ ($emb_t$). Similarly to Transformer and its Multi-Head Attention block, we compute several distinct products of attention -- one per memory unit -- that allows each unit to compute its own input vector.
  • Figure 5: This figure displays the mechanism behind the Working Memory block and more particularly the adaptive leak rate. Each memory unit compute a score from its input vector. Then a softmax is applied on all of this score to compute the leak rate for each unit.
  • ...and 6 more figures