Fast Training of Recurrent Neural Networks with Stationary State Feedbacks
Paul Caillon, Erwan Fagnou, Alexandre Allauzen
TL;DR
This work addresses the training bottleneck of recurrent neural networks caused by backpropagation through time (BPTT) by introducing Diagonal State Feedbacks (DSF), a fixed diagonal feedback mechanism inspired by state-space models that reverses gradient propagation in time. By replacing time-varying Jacobians with a stationary diagonal matrix, gradients are computed via a linear time-invariant backward process, which can be efficiently realized as a convolution in the Fourier domain. The authors provide a computational analysis showing DSF reduces per-step complexity to $O(d)$ and enables accelerated parallelization with prefix-sums or FFTs to reach $O(d T \log T)$ total complexity, while retaining competitive performance on language modeling benchmarks (PTB and WikiText-103) compared to full BPTT and fully truncated BPTT. Empirically, DSF consistently outperforms FT-BPTT, scales well with model size and depth, and remains competitive with SSM-based approaches under similar parameter budgets, suggesting a practical, scalable route for training large recurrent models on resource-constrained settings.
Abstract
Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.
