Table of Contents
Fetching ...

Parallel-in-iteration optimization using multigrid reduction-in-time

G. H. M. Araújo, O. A. Krzysik, H. De Sterck

Abstract

Standard gradient-based iteration algorithms for optimization, such as gradient descent and its various proximal-based extensions to nonsmooth problems, are known to converge slowly for ill-conditioned problems, sometimes requiring many tens of thousands of iterations in practice. Since these iterations are computed sequentially, they may present a computational bottleneck in large-scale parallel simulations. In this work, we present a "parallel-in-iteration" framework that allows one to parallelize across these iterations using multiple processors with the objective of reducing the wall-clock time needed to solve the underlying optimization problem. Our methodology is based on re-purposing parallel time integration algorithms for time-dependent differential equations, motivated by the fact that optimization algorithms often have interpretations as discretizations of time-dependent differential equations (such as gradient flow). Specifically in this work, we use the parallel-in-time method of multigrid reduction-in-time (MGRIT), but note that our approach permits in principle the use of any other parallel-in-time method. We numerically demonstrate the efficacy of our approach on two different model problems, including a standard convex quadratic problem and the nonsmooth elastic obstacle problem in one and two spatial dimensions. For our model problems, we observe fast MGRIT convergence analogous to its prototypical performance on partial differential equations of diffusion type. Some theory is presented to connect the convergence of MGRIT to the convergence of the underlying optimization algorithm. Theoretically predicted parallel speedup results are also provided.

Parallel-in-iteration optimization using multigrid reduction-in-time

Abstract

Standard gradient-based iteration algorithms for optimization, such as gradient descent and its various proximal-based extensions to nonsmooth problems, are known to converge slowly for ill-conditioned problems, sometimes requiring many tens of thousands of iterations in practice. Since these iterations are computed sequentially, they may present a computational bottleneck in large-scale parallel simulations. In this work, we present a "parallel-in-iteration" framework that allows one to parallelize across these iterations using multiple processors with the objective of reducing the wall-clock time needed to solve the underlying optimization problem. Our methodology is based on re-purposing parallel time integration algorithms for time-dependent differential equations, motivated by the fact that optimization algorithms often have interpretations as discretizations of time-dependent differential equations (such as gradient flow). Specifically in this work, we use the parallel-in-time method of multigrid reduction-in-time (MGRIT), but note that our approach permits in principle the use of any other parallel-in-time method. We numerically demonstrate the efficacy of our approach on two different model problems, including a standard convex quadratic problem and the nonsmooth elastic obstacle problem in one and two spatial dimensions. For our model problems, we observe fast MGRIT convergence analogous to its prototypical performance on partial differential equations of diffusion type. Some theory is presented to connect the convergence of MGRIT to the convergence of the underlying optimization algorithm. Theoretically predicted parallel speedup results are also provided.
Paper Structure (26 sections, 1 theorem, 82 equations, 11 figures, 23 tables)

This paper contains 26 sections, 1 theorem, 82 equations, 11 figures, 23 tables.

Key Result

Lemma 5.1

Let $f:\mathbb R^N\rightarrow\mathbb{R}$ be a convex and $L$-smooth function, $g:\mathbb{R}^N \rightarrow \mathbb{R}$ a continuous convex function. Consider a parallel-in-iteration method described by an all-at-once system of the form eq:A(u)=g with $\Phi:=I-sG_{sF}$ the proximal gradient operator

Figures (11)

  • Figure 2.1: (a) and (b): obstacle $\phi$ and solution $\widehat{u}$ of MP2-1D \ref{['eq:mp2']}. (c) and (d): obstacle $\phi$ and solution $\widehat{u}$ of MP2-2D \ref{['eq:mp2']}, computed numerically using the proximal gradient algorithm \ref{['eq:pg']}.
  • Figure 4.1: Fine- and coarse-grid discretization meshes with coarsening factor $m$. F-points (in black) are points $t_i$ for $i=1,\ldots,N_t$ such that $i \not = jm$ for $j=1,\ldots,N_T=\frac{N_t}{m}$, and C-points (in blue) are points $T_j=t_{jm}$ for $j=1,\ldots,N_T=\frac{N_t}{m}$.
  • Figure 6.1: (a): Convergence of MP1, \ref{['eq:mp1']}, with a standard sequential solve using gradient descent method \ref{['eq:gd']}; $\nabla f(\mathbf{u}_q)$ denotes the gradient of the $q$th sequential iterate. (b) Convergence of MP1, \ref{['eq:mp1']}, with MGRIT using gradient descent operator \ref{['eq:gd']} on the fine grid and proximal point operator \ref{['eq:ppm']} on the coarse grid; $\mathbf{r}^{k}$ denotes residual \ref{['eq:mgrit_res']} at iteration $k$, $\nabla f(\mathbf{u}_{N_t}^k)$ denotes the gradient of the approximate solution at the final time point $N_t$ and MGRIT iteration $k$, $\nabla f(\mathbf{u}_0)$ denotes the gradient of the initial guess $\mathbf{u}_0$, and $\mathbf{r}^0$ denotes the initial MGRIT residual.
  • Figure 6.2: (a): 2-norm of $\nabla f\left(\mathbf{u}^{k}_{t_C}\right)$, where $\nabla f$ is the gradient of $f$, and $\mathbf{u}^{k}_{t_C}$ is the approximate solution at coarse time point $t_C$ and iteration $k$. (b): 2-norm of $\mathbf{r}^{k}_{t_C}$, where $\mathbf{r}^{k}_{t_C}$ is the residual \ref{['eq:mgrit_res']} at coarse time point $t_C$ and iteration $k$.
  • Figure 6.3: (a) Convergence of MP2-1D, \ref{['eq:mp2']}, with a standard sequential solve using proximal gradient descent method \ref{['eq:pg']}; $G_{sf}(\mathbf{u}_q)$ denotes the generalized gradient \ref{['eq:G_sF']} of the $q$th sequential iterate. (b): Convergence of MP2-1D, \ref{['eq:mp2']}, with MGRIT using proximal gradient operator \ref{['eq:pg']} on the fine grid and alternating proximal mappings operator \ref{['eq:apm']} on the coarse grid; $\mathbf{r}^{k}$ denotes residual \ref{['eq:mgrit_res']} at iteration $k$, $G_{sF}(\mathbf{u}_{N_t}^k)$ denotes generalized gradient \ref{['eq:G_sF']} of the approximate solution at the final time point $N_t$ and MGRIT iteration $k$, $G_{sF}(\mathbf{u}_0)$ denotes the generalized gradient of the initial guess $\mathbf{u}_0$, and $\mathbf{r}^0$ denotes the initial MGRIT residual.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Lemma 5.1
  • proof
  • remark 1
  • remark 2
  • remark 3: Nesterov's accelerated gradient ODE and hyperbolic PDE connection