Table of Contents
Fetching ...

A stochastic subspace approach to gradient-free optimization in high dimensions

David Kozak, Stephen Becker, Alireza Doostan, Luis Tenorio

Abstract

We present a stochastic descent algorithm for unconstrained optimization that is particularly efficient when the objective function is slow to evaluate and gradients are not easily obtained, as in some PDE-constrained optimization and machine learning problems. The algorithm maps the gradient onto a low-dimensional random subspace of dimension $\ell$ at each iteration, similar to coordinate descent but without restricting directional derivatives to be along the axes. Without requiring a full gradient, this mapping can be performed by computing $\ell$ directional derivatives (e.g., via forward-mode automatic differentiation). We give proofs for convergence in expectation under various convexity assumptions as well as probabilistic convergence results under strong-convexity. Our method extends the well-known Gaussian smoothing technique to descent in subspaces of dimension greater than one, opening the doors to new analysis of Gaussian smoothing when more than one directional derivative is used at each iteration. We also provide a finite-dimensional variant of a special case of the Johnson-Lindenstrauss lemma. Experimentally, we show that our method compares favorably to coordinate descent, Gaussian smoothing, gradient descent and BFGS (when gradients are calculated via forward-mode automatic differentiation) on problems from the machine learning and shape optimization literature.

A stochastic subspace approach to gradient-free optimization in high dimensions

Abstract

We present a stochastic descent algorithm for unconstrained optimization that is particularly efficient when the objective function is slow to evaluate and gradients are not easily obtained, as in some PDE-constrained optimization and machine learning problems. The algorithm maps the gradient onto a low-dimensional random subspace of dimension at each iteration, similar to coordinate descent but without restricting directional derivatives to be along the axes. Without requiring a full gradient, this mapping can be performed by computing directional derivatives (e.g., via forward-mode automatic differentiation). We give proofs for convergence in expectation under various convexity assumptions as well as probabilistic convergence results under strong-convexity. Our method extends the well-known Gaussian smoothing technique to descent in subspaces of dimension greater than one, opening the doors to new analysis of Gaussian smoothing when more than one directional derivative is used at each iteration. We also provide a finite-dimensional variant of a special case of the Johnson-Lindenstrauss lemma. Experimentally, we show that our method compares favorably to coordinate descent, Gaussian smoothing, gradient descent and BFGS (when gradients are calculated via forward-mode automatic differentiation) on problems from the machine learning and shape optimization literature.

Paper Structure

This paper contains 18 sections, 6 theorems, 52 equations, 6 figures.

Key Result

theorem 1

Assume (A0), (A1), (A2), (A3) and let $\mathbf{x}_0$ be an arbitrary initialization. Then recursion eq: iterations with $0<\alpha <2\ell/(d\lambda)$ results in $f(\mathbf{x}_k) \overset{a.s.}{\longrightarrow} f_*$ and $f(\mathbf{x}_k) \overset{L^1}{\longrightarrow} f_*$.

Figures (6)

  • Figure 1: Contour plots for probability of successful embedding for various values of $\ell$, $d$, and $\epsilon$. Each of the figures share the same horizontal and vertical range. Left: $\epsilon=0.01$. Center: $\epsilon=0.1$. Right: $\epsilon=0.2$.
  • Figure 2: Minimizing a function from the family \ref{['eq: NesterovWorst']} with $r=20,~\lambda =8$. CD represents randomized block-coordinate descent. In several of the subfigures gradient descent overlaps randomized block-coordinate descent. The shaded regions in the SSD cases represent the interval between best $10^{\text{th}}$ and $90^{\text{th}}$ percentile performance after $1000$ runs. The vertical-axis is the relative error: $(f(\mathbf{x}_k)-f_*)/f_*$. Left: $d=100$. Center: $d=1000$. Right: $d=10000$. Top: Step-size chosen by a backtracking linesearch with Armijo conditions. Bottom: Fixed step-size.
  • Figure 3: Minimizing a function from the family \ref{['eq: NesterovWorst']} with $r=d,~\lambda =1$. CD represents randomized block-coordinate descent. Step-size in all cases is chosen by a backtracking linesearch with Armijo conditions. Left: $d=3,~p=50$, total parameters = 153. Center: $d=10,~p=50$, total parameters = 503. Right: $d=20,~p=100$, total parameters = 2003.
  • Figure 4: Left: 30-dimensional problem. Right: 60-dimensional problem. $M_{\ell}$ is the number of function evaluations required to attain a cut-off threshold for various values of $\ell$. For a fixed initialization BFGS is non-random, represented by the vertical line. Gradient descent, not pictured, has a vertical line at $\tau=2850$ and $\tau=22828$ for $p=30$ and $p=60$, respectively. $\ell=1$ is equivalent to the method proposed in nesterov2017random when $h=0$.
  • Figure 5: Left: Schematic of the linear elasticity problem used in the shape optimization example of Section \ref{['subsect: experiments-plate']}. Right: Conforming finite element mesh used to solve for maximum stress $\sigma_y$ along the $y$ direction. Only a quarter of the plate corresponding to $\theta\in[0,\pi/2]$ is modeled.
  • ...and 1 more figures

Theorems & Definitions (10)

  • theorem 1: Convergence of SSD
  • corollary 1: Convergence under strong-convexity and rate of convergence
  • theorem 2: Convergence under convexity
  • theorem 3: Non-convex convergence
  • definition 1: Successful isometric embedding
  • lemma 1: Approximately isometric embedding using Haar-distributed matrices
  • remark 1: Coordinate sampling is rarely an isometry
  • remark 2
  • remark 3
  • theorem 4: Probabilistic rate of convergence. Strongly-convex case