Table of Contents
Fetching ...

Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Paul Caillon, Erwan Fagnou, Alexandre Allauzen

TL;DR

This work addresses the training bottleneck of recurrent neural networks caused by backpropagation through time (BPTT) by introducing Diagonal State Feedbacks (DSF), a fixed diagonal feedback mechanism inspired by state-space models that reverses gradient propagation in time. By replacing time-varying Jacobians with a stationary diagonal matrix, gradients are computed via a linear time-invariant backward process, which can be efficiently realized as a convolution in the Fourier domain. The authors provide a computational analysis showing DSF reduces per-step complexity to $O(d)$ and enables accelerated parallelization with prefix-sums or FFTs to reach $O(d T \log T)$ total complexity, while retaining competitive performance on language modeling benchmarks (PTB and WikiText-103) compared to full BPTT and fully truncated BPTT. Empirically, DSF consistently outperforms FT-BPTT, scales well with model size and depth, and remains competitive with SSM-based approaches under similar parameter budgets, suggesting a practical, scalable route for training large recurrent models on resource-constrained settings.

Abstract

Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.

Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

TL;DR

This work addresses the training bottleneck of recurrent neural networks caused by backpropagation through time (BPTT) by introducing Diagonal State Feedbacks (DSF), a fixed diagonal feedback mechanism inspired by state-space models that reverses gradient propagation in time. By replacing time-varying Jacobians with a stationary diagonal matrix, gradients are computed via a linear time-invariant backward process, which can be efficiently realized as a convolution in the Fourier domain. The authors provide a computational analysis showing DSF reduces per-step complexity to and enables accelerated parallelization with prefix-sums or FFTs to reach total complexity, while retaining competitive performance on language modeling benchmarks (PTB and WikiText-103) compared to full BPTT and fully truncated BPTT. Empirically, DSF consistently outperforms FT-BPTT, scales well with model size and depth, and remains competitive with SSM-based approaches under similar parameter budgets, suggesting a practical, scalable route for training large recurrent models on resource-constrained settings.

Abstract

Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.

Paper Structure

This paper contains 35 sections, 11 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of Training and Validation Perplexity across Different Model Configurations on Penn Treebank. Each subfigure represents a different network configuration, varying in the number of layers and hidden size.
  • Figure 2: Comparison of Validation Perplexity across Different Hidden Dimension Sizes on Wikitext-103. The dimension size varies from 128 to 1024.
  • Figure 3: Comparison of Best Validation Perplexity across Different RNN Types on Wikitext-103. The models include standard RNNs, GRUs, and LSTMs of 3 layers with hidden size 512, trained with BPTT, DSF, and FT-BPTT.