Table of Contents
Fetching ...

Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems

Hunter S. Heidenreich, Pantelis R. Vlachas, Petros Koumoutsakos

TL;DR

This study decomposes the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers, and reveals that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental.

Abstract

Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.

Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems

TL;DR

This study decomposes the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers, and reveals that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental.

Abstract

Machine learning architectures, including transformers and recurrent neural networks (RNNs) have revolutionized forecasting in applications ranging from text processing to extreme weather. Notably, advanced network architectures, tuned for applications such as natural language processing, are transferable to other tasks such as spatiotemporal forecasting tasks. However, there is a scarcity of ablation studies to illustrate the key components that enable this forecasting accuracy. The absence of such studies, although explainable due to the associated computational cost, intensifies the belief that these models ought to be considered as black boxes. In this work, we decompose the key architectural components of the most powerful neural architectures, namely gating and recurrence in RNNs, and attention mechanisms in transformers. Then, we synthesize and build novel hybrid architectures from the standard blocks, performing ablation studies to identify which mechanisms are effective for each task. The importance of considering these components as hyper-parameters that can augment the standard architectures is exhibited on various forecasting datasets, from the spatiotemporal chaotic dynamics of the multiscale Lorenz 96 system, the Kuramoto-Sivashinsky equation, as well as standard real world time-series benchmarks. A key finding is that neural gating and attention improves the performance of all standard RNNs in most tasks, while the addition of a notion of recurrence in transformers is detrimental. Furthermore, our study reveals that a novel, sparsely used, architecture which integrates Recurrent Highway Networks with neural gating and attention mechanisms, emerges as the best performing architecture in high-dimensional spatiotemporal forecasting of dynamical systems.
Paper Structure (40 sections, 22 equations, 17 figures, 13 tables)

This paper contains 40 sections, 22 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Information flows through recurrent cells. Layers are denoted with squares and element-wise operations with circles. $\sigma$ is the sigmoid function and $\tau$ is the hyperbolic tangent. For the RHN, we depict a $L=2$ cell where the unit highlighted in red is repeated twice. Concatenation is denoted by the intersection of two directed lines, and copying is denoted by their forking.
  • Figure 2: Visualization of a pre-layer normalization (left) or post-layer normalization (right) Transformer block.
  • Figure 3: Visualized information flows through neural gates. Gate types are of increasing complexity from left to right. Orange circles denote element-wise operations without learnable parameters, yellow rectangles denote a full layer.
  • Figure 4: Scaled dot-product attention as a gate. $\mathbf{x}_i$ is a single element in a sequence $\mathbf{X} \in \mathbb{R}^{T_1 \times d}$ gating sequence $\mathbf{Y} \in \mathbb{R}^{T_2 \times d}$. A full attention operation consists of $T_1$ circuits executed in parallel, one for each element $\mathbf{x}_i \in \mathbf{X}$. ${\rm SM}$ denotes Softmax.
  • Figure 5: The evolution of average NRMSE error for top models of each architecture on Multiscale Lorenz-96 with $F = 10$. NRMSE is averaged with respect to the initial conditions in the test split. Though many models are able to achieve comparable performance, the RHN exhibits the least error.
  • ...and 12 more figures