Parallelizing non-linear sequential models over the sequence length
Yi Heng Lim, Qi Zhu, Joshua Selfridge, Muhammad Firmansyah Kasim
TL;DR
The paper tackles the bottleneck of training non-linear sequential models such as RNNs and NeuralODEs by introducing the DEER framework, which recasts non-linear differential equations as a fixed-point problem with quadratic convergence using an inverse linear operator $L_ ext{G}^{-1}$. By selecting $G_p(oldsymbol{r}) = -\partial_p\mathbf{f}$, DEER achieves Newton-like, quadratic convergence and allows parallel computation of forward and backward passes without changing model architectures, leveraging parallel prefix scans and FFT-like operations. Empirically, DEER delivers orders-of-magnitude speedups: forward evaluations up to $>10^3\times$, and training improvements up to $>10\times$ on long sequences, while maintaining comparable accuracy to standard sequential solvers. The framework is demonstrated on diverse tasks including long-sequence GRUs (EigenWorms), Hamiltonian Neural Networks for physical systems, and CIFAR-10 sequence classification, underscoring its potential to accelerate exploration and deployment of non-linear sequential models in practice.
Abstract
Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought sequential models could not be parallelized. We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy. The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures. Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results. Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time series classification problem with 17k time samples. By overcoming the training bottleneck, our work serves as the first step to unlock the potential of non-linear sequential models for long sequence problems.
