Towards Scalable and Stable Parallelization of Nonlinear RNNs
Xavier Gonzalez, Andrew Warrington, Jimmy T. H. Smith, Scott W. Linderman
TL;DR
The paper tackles scalable parallel evaluation of nonlinear RNNs by reframing inference as solving a fixed-point with residual $\mathbf{r}(\mathbf{s}_{1:T}) = 0$ and applying Newton-type methods. It introduces quasi-DEER (diagonal Jacobian approximations) to reduce cubic $D$-dependent costs and ELK (trust-region via Kalman smoothing) to stabilize iterations, with quasi-ELK combining both ideas. The authors prove global convergence of DEER, extend convergence guarantees to quasi-DEER, and demonstrate substantial speedups and memory savings in experiments across evaluation and training tasks, including autoregressive GRUs and chaotic dynamics. ELK provides a robust alternative in regimes where DEER struggles, while quasi-ELK often offers the best wall-clock performance under stability constraints. Overall, the work enables scalable, stable parallel evaluation of nonlinear RNNs and outlines practical guidance for selecting among methods depending on dynamics and hardware constraints.
Abstract
Transformers and linear state space models can be evaluated in parallel on modern hardware, but evaluating nonlinear RNNs appears to be an inherently sequential problem. Recently, however, Lim et al. '24 developed an approach called DEER, which evaluates nonlinear RNNs in parallel by posing the states as the solution to a fixed-point problem. They derived a parallel form of Newton's method to solve the fixed-point problem and achieved significant speedups over sequential evaluation. However, the computational complexity of DEER is cubic in the state size, and the algorithm can suffer from numerical instability. We address these limitations with two novel contributions. To reduce the computational complexity, we apply quasi-Newton approximations and show they converge comparably to Newton, use less memory, and are faster. To stabilize DEER, we leverage a connection between the Levenberg-Marquardt algorithm and Kalman smoothing, which we call ELK. This connection allows us to stabilize Newton's method while using efficient parallelized Kalman smoothing algorithms to retain performance. Through several experiments, we show that these innovations allow for parallel evaluation of nonlinear RNNs at larger scales and with greater stability.
