Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Xavier Gonzalez

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Xavier Gonzalez

Abstract

Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Abstract

Paper Structure (139 sections, 18 theorems, 192 equations, 29 figures, 6 tables, 2 algorithms)

This paper contains 139 sections, 18 theorems, 192 equations, 29 figures, 6 tables, 2 algorithms.

Introduction and Background
Introduction
Extended History
Outline
Background
Dynamics: State Space Models
State Space Models (SSMs)
Problem Statement (Unrolling an SSM):
Examples of SSMs
Bayesian inference for linear Gaussian SSMs: Kalman filtering and smoothing
Limitation of SSMs: "Inherently Sequential"
Parallel Computing: The Parallel Associative Scan
The Parallel Scan: A Gentle Introduction
Simple example: multiplying a sequence of matrices
Detail #1: Parallel scans for arbitrary binary associative operators
...and 124 more sections

Key Result

Proposition 1

Say we are trying to find a root of $\mathbf{r}(\mathbf{s}): \mathbb{R}^P \mapsto \mathbb{R}^P$ with Newton's method as defined in eq:newton_root. If we assume that $\mathbf{J}(\mathbf{s})$ is $L$-Lipschitz and is always invertible with $\| \mathbf{J}(\mathbf{s})^{-1} \| \leq \beta$ for all $\mathbf

Figures (29)

Figure 1: Unrolling an SSM. We shade the initial state $s_0$ to indicate that we know the initial condition.
Figure 2: Graphical diagram showing the equivalence (based on currying) between an SSM driven by inputs and an autonomous system with time-varying transition dynamics. We shade the inputs $u_t$ to indicate that they are known.
Figure 3: A linear Gaussian state space model (LGSSM): The LGSSM consists of latent variables $s_t$ and observed variables $o_t$. The generative model of the LGSSM consists of dynamics $s_{t+1} \sim \mathcal{N}\left( A s_t, Q \right)$ and emissions $o_{t+1} \sim \mathcal{N}\left( C s_{t+1}, R \right)$.
Figure 4: Parallel Scan for Matrix Multiplication. We illustrate a divide-and-conquer approach to compute the product $A_4 A_3 A_2 A_1$. Note that this divide-and-conquer approach naturally leads to $\mathcal{O}(\log T)$ depth.
Figure 5: Newton's method for root-finding. Here we illustrate 3 iterations of Newton's method for root-finding on the one-dimensional cubic function $\mathbf{r}(\mathbf{s}) = (\mathbf{s} - 0.4)^3 + 0.45 (\mathbf{s} - 0.4)$. We observe that each iteration of Newton's method involves linearizing the function to obtain $\mathbf{\hat{r}}^{(i)}(\cdot)$ (shown in color) and then finding the zero of this linearization to obtain our next guess.
...and 24 more figures

Theorems & Definitions (42)

Definition 1: Closure
Definition 2: Q-convergence
Proposition 1
proof
Example 1: Newton's method can diverge: $\mathbf{r}(\mathbf{s}) = \mathbf{s}^{1/3}$.
Proposition 2
proof
Proposition 3
proof : Proof
Definition 3: Predictability and Unpredictability
...and 32 more

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Abstract

Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Authors

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (42)