Table of Contents
Fetching ...

Were RNNs All We Needed?

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein Hajimirsadeghi

TL;DR

The paper demonstrates that by removing historical dependencies in LSTM and GRU gates, traditional RNNs can be trained in parallel using a parallel-scan formulation, yielding minimal variants minLSTM and minGRU with fewer parameters and competitive performance. Through comprehensive experiments across selective copying, long-range benchmarks, reinforcement learning, and Shakespeare language modeling, these minimal RNNs achieve results comparable to Transformers and state-of-the-art recurrent models while offering substantial training speedups. The work provides practical PyTorch implementations and highlights the potential of re-emphasizing simple, scalable recurrent architectures in the era of large Transformers. It also discusses hardware limitations and suggests that with adequate resources, these minimal RNNs could generalize well to larger-scale settings, challenging the assumption that Transformers are universally superior for sequence modelling.

Abstract

The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully parallelizable during training, and (3) achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.

Were RNNs All We Needed?

TL;DR

The paper demonstrates that by removing historical dependencies in LSTM and GRU gates, traditional RNNs can be trained in parallel using a parallel-scan formulation, yielding minimal variants minLSTM and minGRU with fewer parameters and competitive performance. Through comprehensive experiments across selective copying, long-range benchmarks, reinforcement learning, and Shakespeare language modeling, these minimal RNNs achieve results comparable to Transformers and state-of-the-art recurrent models while offering substantial training speedups. The work provides practical PyTorch implementations and highlights the potential of re-emphasizing simple, scalable recurrent architectures in the era of large Transformers. It also discusses hardware limitations and suggests that with adequate resources, these minimal RNNs could generalize well to larger-scale settings, challenging the assumption that Transformers are universally superior for sequence modelling.

Abstract

The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully parallelizable during training, and (3) achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.
Paper Structure (45 sections, 19 equations, 7 figures, 6 tables, 8 algorithms)

This paper contains 45 sections, 19 equations, 7 figures, 6 tables, 8 algorithms.

Figures (7)

  • Figure 1: Training runtime (left), speedup (middle), and memory footprint (right) on a T4 GPU for a batch size of $64$. In the training runtime plot (left), minGRU, minLSTM, and Mamba lines overlap. These methods are approximately the same in training runtime.
  • Figure 2: Language Modelling results on the Shakespeare dataset. Minimal versions of decade-old RNNs (LSTMs and GRUs) performed comparably to Mamba and Transformers. Transformers required $\sim 2.5\times$ more training steps to achieve comparable performance, overfitting eventually.
  • Figure 3: Runtime Comparison of Inference with Context Tokens: Parallelizable RNNs (minLSTM, minGRU, and Mamba) vs. Traditional RNNs (LSTM and GRU). As sequential models, LSTM and GRU exhibit significantly slower inference times when processing an increasing number of context tokens, compared to the parallelizable models minLSTM, minGRU, and Mamba.
  • Figure 4: Runtime Comparison of Inference: Minimal RNNs (minLSTM and minGRU) vs. Traditional Counterparts (LSTM and GRU). As simplified versions of LSTM and GRU, minLSTM and minGRU generally exhibit faster inference times, particularly with larger batch sizes, as shown in the plots.
  • Figure 5: Impact of Forget Gate Bias Initialization on Training Efficiency. The plot illustrates how increasing the bias of the forget gate in minLSTM enhances training efficiency by promoting earlier retention of information, leading to faster convergence and a more stable training process.
  • ...and 2 more figures