Parallelizing non-linear sequential models over the sequence length

Yi Heng Lim; Qi Zhu; Joshua Selfridge; Muhammad Firmansyah Kasim

Parallelizing non-linear sequential models over the sequence length

Yi Heng Lim, Qi Zhu, Joshua Selfridge, Muhammad Firmansyah Kasim

TL;DR

The paper tackles the bottleneck of training non-linear sequential models such as RNNs and NeuralODEs by introducing the DEER framework, which recasts non-linear differential equations as a fixed-point problem with quadratic convergence using an inverse linear operator $L_ ext{G}^{-1}$. By selecting $G_p(oldsymbol{r}) = -\partial_p\mathbf{f}$, DEER achieves Newton-like, quadratic convergence and allows parallel computation of forward and backward passes without changing model architectures, leveraging parallel prefix scans and FFT-like operations. Empirically, DEER delivers orders-of-magnitude speedups: forward evaluations up to $>10^3\times$, and training improvements up to $>10\times$ on long sequences, while maintaining comparable accuracy to standard sequential solvers. The framework is demonstrated on diverse tasks including long-sequence GRUs (EigenWorms), Hamiltonian Neural Networks for physical systems, and CIFAR-10 sequence classification, underscoring its potential to accelerate exploration and deployment of non-linear sequential models in practice.

Abstract

Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought sequential models could not be parallelized. We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy. The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures. Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results. Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples. By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.

Parallelizing non-linear sequential models over the sequence length

TL;DR

. By selecting

, DEER achieves Newton-like, quadratic convergence and allows parallel computation of forward and backward passes without changing model architectures, leveraging parallel prefix scans and FFT-like operations. Empirically, DEER delivers orders-of-magnitude speedups: forward evaluations up to

, and training improvements up to

on long sequences, while maintaining comparable accuracy to standard sequential solvers. The framework is demonstrated on diverse tasks including long-sequence GRUs (EigenWorms), Hamiltonian Neural Networks for physical systems, and CIFAR-10 sequence classification, underscoring its potential to accelerate exploration and deployment of non-linear sequential models in practice.

Abstract

Paper Structure (31 sections, 54 equations, 8 figures, 6 tables)

This paper contains 31 sections, 54 equations, 8 figures, 6 tables.

Introduction
Related works
DEER framework
DEER framework
Derivatives
Practical implementation
Parallelizing ordinary differential equations (ODE)
Parallelizing RNN
Complexity and limitations
Experiments
Performance benchmarking
Learning physical systems with NeuralODE
Time-series classification with recurrent neural network (RNN)
Sequence classification with multiple heads RNN
Conclusion
...and 16 more sections

Figures (8)

Figure 1: Evaluating sequential models using (a) sequential method and (b) iterative method that is parallelizable.
Figure 2: The speed up of GRU calculated using DEER method (this paper) vs commonly-used sequential method on a V100 GPU for (top) forward and (bottom) forward + gradient calculations. The missing data for large number of dimensions and sequence lengths is due to insufficient memory in the DEER method. The bar height represents the mean speed up over 5 different random seeds.
Figure 3: (a) The comparison between the outputs of GRU evaluated with sequential method vs DEER method. The line for sequential method output is almost not visible because overlaid by the output of DEER method. Only the last 200 indices are shown for clarity. (b) The difference between the outputs of sequential and DEER method for the whole 10k sample length.
Figure 4: (Top) The validation losses of HNN with NeuralODE training using DEER method (shown in blue) vs RK45 method (in orange) as a function of (a) training hours and (b) training steps. (Bottom) The validation accuracy of RNN training using DEER method (blue) vs the sequential method (orange) as a function of (c) training hours and (d) training steps.
Figure 5: Architecture for the EigenWorms experiments.
...and 3 more figures

Parallelizing non-linear sequential models over the sequence length

TL;DR

Abstract

Parallelizing non-linear sequential models over the sequence length

Authors

TL;DR

Abstract

Table of Contents

Figures (8)