Table of Contents
Fetching ...

A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Leo Kozachkov, Christopher Ré, Scott W. Linderman

Abstract

Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Abstract

Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

Paper Structure

This paper contains 52 sections, 7 theorems, 58 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

The Picard iteration operator $\mathcal{A}_P$ given by eq:picard_shih is a special case of an LDS, eq:common_form, where the transition matrix is the identity, $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: Parallel scan for LDS. We illustrate the parallel scan for a sequence of length $T=4$ for the simple LDS $x_t = A_t x_{t-1}$.
  • Figure 2: A single Newton iteration solves the $S_5$ group word problem, whereas the number of iterations required for the other methods increases with sequence length. We consider the task of evaluating the product of $S_5$ group elements. A: The group word problem can be expressed as an LDS with input-dependent state-transition matrices. B: An example input-dependent transition matrix $A_t$ for permutation $(1\ 5\ 2\ 4\ 3)$, in cycle notation. C: For each fixed-point method and a range of sequence lengths, $T$, we compute the median (over ten random seeds) number of fixed-point iterations to converge (top) and the median wall-clock time (bottom). While a single Newton iteration is sufficient to solve the $S_5$ problem, the number of iterations required for the other methods increases with the sequence length.
  • Figure 3: Picard iterations struggle to parallelize RNNs. We evaluate GRUs with random parameter initialization for different sequence lengths $T$ and hidden state sizes $D$. A: The nonlinear dynamics of a GRU, following Feng2024, where $x_t$ is the hidden state, $u_t$ is the input, and the notation $\mathrm{Linear}[\cdot, \cdot]$ indicates a linear readout from the concatenation of two vectors. B: A representative Jacobian matrix $\partial f_{t}/\partial x$ from a GRU trajectory, which is not well approximated by the identity matrix. C: For each fixed-point method and a range of sequence lengths, $T$, and state sizes, $D$, we compute the median (over ten random seeds) number of fixed-point iterations to converge (top row) and the median wall-clock time (bottom row). Picard iterations take nearly $T$ iterations to converge, while the other fixed point methods yield order-of-magnitude speed-ups over sequential evaluation
  • Figure 4: Jacobi iterations struggle when the dynamics Jacobian is close to the identity. We evaluate Langevin dynamics for a potential $\phi$. A: The nonlinear dynamics of Langevin dynamics for a potential $\phi$ and step size $\epsilon$, where $x_t$ is the state and $w_t$ is Gaussian noise. B: The Jacobian for Langevin dynamics is well-approximated by the identity matrix, especially for small step size $\epsilon=1.0e-5$. C: We evaluate Langevin dynamics for larger dimensions, plotting the median of 10 random seeds. Jacobi iteration consistently take $T$ steps and are always slower than sequential, while the other fixed-point methods converge in fewer $T$ steps and can be faster than sequential. The missing Newton iteration points indicate the GPU ran out of memory.
  • Figure 5: Parallel Scan for Matrix Multiplication. We illustrate a divide-and-conquer approach to compute the product $A_4 A_3 A_2 A_1$. Note that this divide-and-conquer approach naturally leads to $\mathcal{O}(\log T)$ depth.
  • ...and 5 more figures

Theorems & Definitions (18)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3: c.f. Proposition 4 of lu2025parasolver
  • proof
  • Lemma 1
  • proof
  • Definition 1: Group Word Problem
  • Proposition 4
  • ...and 8 more