Towards Scalable and Stable Parallelization of Nonlinear RNNs

Xavier Gonzalez; Andrew Warrington; Jimmy T. H. Smith; Scott W. Linderman

Towards Scalable and Stable Parallelization of Nonlinear RNNs

Xavier Gonzalez, Andrew Warrington, Jimmy T. H. Smith, Scott W. Linderman

TL;DR

The paper tackles scalable parallel evaluation of nonlinear RNNs by reframing inference as solving a fixed-point with residual $\mathbf{r}(\mathbf{s}_{1:T}) = 0$ and applying Newton-type methods. It introduces quasi-DEER (diagonal Jacobian approximations) to reduce cubic $D$-dependent costs and ELK (trust-region via Kalman smoothing) to stabilize iterations, with quasi-ELK combining both ideas. The authors prove global convergence of DEER, extend convergence guarantees to quasi-DEER, and demonstrate substantial speedups and memory savings in experiments across evaluation and training tasks, including autoregressive GRUs and chaotic dynamics. ELK provides a robust alternative in regimes where DEER struggles, while quasi-ELK often offers the best wall-clock performance under stability constraints. Overall, the work enables scalable, stable parallel evaluation of nonlinear RNNs and outlines practical guidance for selecting among methods depending on dynamics and hardware constraints.

Abstract

Transformers and linear state space models can be evaluated in parallel on modern hardware, but evaluating nonlinear RNNs appears to be an inherently sequential problem. Recently, however, Lim et al. '24 developed an approach called DEER, which evaluates nonlinear RNNs in parallel by posing the states as the solution to a fixed-point problem. They derived a parallel form of Newton's method to solve the fixed-point problem and achieved significant speedups over sequential evaluation. However, the computational complexity of DEER is cubic in the state size, and the algorithm can suffer from numerical instability. We address these limitations with two novel contributions. To reduce the computational complexity, we apply quasi-Newton approximations and show they converge comparably to Newton, use less memory, and are faster. To stabilize DEER, we leverage a connection between the Levenberg-Marquardt algorithm and Kalman smoothing, which we call ELK. This connection allows us to stabilize Newton's method while using efficient parallelized Kalman smoothing algorithms to retain performance. Through several experiments, we show that these innovations allow for parallel evaluation of nonlinear RNNs at larger scales and with greater stability.

Towards Scalable and Stable Parallelization of Nonlinear RNNs

TL;DR

The paper tackles scalable parallel evaluation of nonlinear RNNs by reframing inference as solving a fixed-point with residual

and applying Newton-type methods. It introduces quasi-DEER (diagonal Jacobian approximations) to reduce cubic

-dependent costs and ELK (trust-region via Kalman smoothing) to stabilize iterations, with quasi-ELK combining both ideas. The authors prove global convergence of DEER, extend convergence guarantees to quasi-DEER, and demonstrate substantial speedups and memory savings in experiments across evaluation and training tasks, including autoregressive GRUs and chaotic dynamics. ELK provides a robust alternative in regimes where DEER struggles, while quasi-ELK often offers the best wall-clock performance under stability constraints. Overall, the work enables scalable, stable parallel evaluation of nonlinear RNNs and outlines practical guidance for selecting among methods depending on dynamics and hardware constraints.

Abstract

Paper Structure (51 sections, 3 theorems, 23 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 51 sections, 3 theorems, 23 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Problem Statement
Jacobian of the Residual
DEER: Newton's Method for Parallel Evaluation of Sequential Models
Derivation of DEER from Newton's Method
Global Convergence of DEER
Weaknesses of DEER
Scaling and Stabilizing Newton's Method for Parallel Evaluation
Quasi-DEER: Scaling DEER with Diagonal Jacobian Approximations
ELK: Stabilizing DEER with Trust Regions
Quasi-ELK: Scalability and Stability
Implementation Details
Limitations
Related Work
RNNs and Parallelism
...and 36 more sections

Key Result

Proposition 1

Undamped Newton's method will converge to the true solution, $\mathbf{s}^*_{1:T}$, of the fixed-point equation eq:fixed_point in at most $T$ Newton iterations, for any initial $\mathbf{s}^{(0)}_{1:T}$.

Figures (9)

Figure 1: Overview of the parallelizable methods we consider in this paper. We introduce diagonal approximations to improve complexity (quasi-DEER, Section \ref{['ssc:quasi']}) and link to Kalman filtering and trust regions to improve stability (ELK, Section \ref{['ssc:ELK']}). We combine these ideas in quasi-ELK (Section \ref{['ssc:ELK']}).
Figure 2: Evaluating an untrained GRU. Relative performance of sequential, DEER and quasi-DEER for evaluating a randomly initialized (and untrained) GRU on (Top Row) wall-clock time, averaged over 20 random seeds and (Bottom Row) memory, averaged over 3 random seeds. All experiments use a 16GB V100 SMX2 (memory capacity indicated by the black dashed line) and Newton methods were run to convergence. Missing points in each series indicate the GPU ran out of memory. Quasi-DEER has a runtime commensurate with DEER, but with lower memory consumption, allowing quasi-DEER to work at scales where DEER cannot. The accuracy of the final converged solution is similar for all methods (see Figure \ref{['fig:quasi_acc']} in Appendix \ref{['app:exp1']}).
Figure 3: Training a GRU with DEER. Comparison of DEER and quasi-DEER during GRU training for the C. elegans time-series classification task (Section \ref{['ssc:worms']}). Each time series has length $T=17,984$. We show the median, and 5-95% interval across a rolling window of 20 training steps. (Left) DEER and quasi-DEER have the similar validation accuracy trajectories, indicating similar training dynamics. The sequential trace shown is for 24 hours of training (compared to 11 and 4 hours for the whole DEER and quasi-DEER traces). (Center) Each quasi training iteration is 2.5 times faster than each DEER training iteration. Sequential training steps took more than 6 seconds each (not pictured). (Right) Each quasi training iteration requires (approximately) 2 times more Newton iterations to converge, indicating that each quasi Newton step is approximately 5 times faster than the corresponding DEER Newton step.
Figure 4: ELK stabilizes parallel evaluation of an AR GRU. (Top Left) The mean absolute difference (MAD) evaluated on the outputs converges rapidly for all four methods on a sequence generated by an untrained AR GRU. (Top Right) The MAD for evaluating a trained AR GRU. Undamped DEER variants are unstable and converge slowly (using the reset heuristic). ELK stabilizes and accelerates convergence. (Bottom) The output after 1, 100, 1000, and 2000 Newton iterations. The black dotted line is the true trace. ELK and quasi-ELK converge rapidly, but DEER and quasi-DEER are unstable. The lines where DEER and quasi-DEER are zero depict the zeroing heuristic.
Figure 5: The accuracy of evaluating with parallelized methods (DEER and quasi-DEER) as opposed to sequential evaluation. The parallelized methods converge to the correct trace within numerical precision. The hidden state size is $D=4$ and the sequence length is $T=10,000$.
...and 4 more figures

Theorems & Definitions (7)

Proposition 1
proof : Proof sketch
Proposition 2
proof : Proof
proof
Proposition 3
proof

Towards Scalable and Stable Parallelization of Nonlinear RNNs

TL;DR

Abstract

Towards Scalable and Stable Parallelization of Nonlinear RNNs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)