Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity

K. J. Dzahini; S. M. Wild

Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity

K. J. Dzahini, S. M. Wild

TL;DR

The analysis of the second-order behavior of the mesh adaptive direct-search (MADS) algorithm using a second-order-like extension of the Rademacher's theorem-based definition of the Clarke subdifferential is extended to the StoDARS framework, making it the first in a stochastic direct-search setting, to the best of the authors' knowledge.

Abstract

The work presented here is motivated by the development of StoDARS, a framework for large-scale stochastic blackbox optimization that not only is both an algorithmic and theoretical extension of the stochastic directional direct-search (SDDS) framework but also extends to noisy objectives a recent framework of direct-search algorithms in reduced spaces (DARS). Unlike SDDS, StoDARS achieves scalability by using~$m$ search directions generated in random subspaces defined through the columns of Johnson--Lindenstrauss transforms (JLTs) obtained from Haar-distributed orthogonal matrices. For theoretical needs, the quality of these subspaces and the accuracy of random estimates used by the algorithm are required to hold with sufficiently large, but fixed, probabilities. By leveraging an existing supermartingale-based framework, the expected complexity of StoDARS is proved to be similar to that of SDDS and other stochastic full-space methods up to constants, when the objective function is continuously differentiable. By dropping the latter assumption, the convergence of StoDARS to Clarke stationary points with probability one is established. Moreover, the analysis of the second-order behavior of the mesh adaptive direct-search (MADS) algorithm using a second-order-like extension of the Rademacher's theorem-based definition of the Clarke subdifferential (so-called generalized Hessian) is extended to the StoDARS framework, making it the first in a stochastic direct-search setting, to the best of our knowledge.

Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity

TL;DR

Abstract

search directions generated in random subspaces defined through the columns of Johnson--Lindenstrauss transforms (JLTs) obtained from Haar-distributed orthogonal matrices. For theoretical needs, the quality of these subspaces and the accuracy of random estimates used by the algorithm are required to hold with sufficiently large, but fixed, probabilities. By leveraging an existing supermartingale-based framework, the expected complexity of StoDARS is proved to be similar to that of SDDS and other stochastic full-space methods up to constants, when the objective function is continuously differentiable. By dropping the latter assumption, the convergence of StoDARS to Clarke stationary points with probability one is established. Moreover, the analysis of the second-order behavior of the mesh adaptive direct-search (MADS) algorithm using a second-order-like extension of the Rademacher's theorem-based definition of the Clarke subdifferential (so-called generalized Hessian) is extended to the StoDARS framework, making it the first in a stochastic direct-search setting, to the best of our knowledge.

Paper Structure (14 sections, 25 theorems, 65 equations, 5 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 25 theorems, 65 equations, 5 figures, 1 table, 2 algorithms.

Introduction
Stochastic directional direct-search algorithm and random subspace polling
Full-space unconstrained stochastic directional direct-search method
Random subspace selection and subspace polling
Random subspace direct search and its resulting stochastic process
StoDARS algorithm
Stochastic process generated by StoDARS
Zeroth-order convergence
Expected complexity analysis in the unconstrained case
General renewal-reward discrete time process and its stopping time
Expected complexity result
Convergence to Clarke stationary points
Convergence to second-order stationary points
Numerical results

Key Result

Proposition 2.1

Assume that $\bm{{\stackunder[0.6pt]{$Q$}{}}}\in\mathbb{R}^{n\times p}$ satisfies Assumption JltAss1, and let $\mathbb{D}^p\subset \mathbb{R}^p$ be a PSS. For any $\bm{v}\in \mathbb{R}^n$, In other words, with high probability, there always exists a random direction $\bm{{\stackunder[0.6pt]{$Q$}{}}}\stackunder[1.0pt]{$d$}{}_{\star}^p\in\left\lbrace\bm{{\stackunder[0.6pt]{$Q$}{}}}\bm{d}^p:\bm{d}^p

Figures (5)

Figure 1: Illustration of a minimal positive basis $\mathbb{D}^p\subset\mathbb{S}^{p-1}$ of $\mathbb{R}^p$, with $p=2$, and the resulting set of poll directions $\mathbb{U}^n:=\left\lbrace\bm{U}_{n \times p}\bm{d}^p:\bm{d}^p\in \mathbb{D}^p\right\rbrace$, with $n=3$, where $\bm{U}_{n \times p}:=\left[\hbox{$\bm{U}_1$}\cdots \hbox{$\bm{U}_p$}\right]\in\mathbb{R}^{n\times p}$ is a matrix with orthonormal columns. As demonstrated by Proposition \ref{['Prop2Point3']}, $\mathbb{D}^p$ has the same positive spanning properties as $\mathbb{U}^n$ in the subspace generated by the vectors $\bm{U}_j$, $j\in\left[\!\left[1,p\right]\!\right]$, illustrated by the blue hyperplane on the right. When $\bm{v}\in\mathbb{R}^n$ is a descent direction at $\bm{x}$, say $\bm{v}=-\nabla f(\bm{x})\neq\bm{0}$ assuming $f$ differentiable, the selection of the yellow hyperplane must be avoided since none of its poll directions makes an acute angle with $\bm{v}$ as desired by complexity theory; a situation that may also impact the efficiency of an algorithm using such poll directions. Indeed, whenever the yellow hyperplane is selected, it holds that $\bm{U}_{n \times p}^{\top}\bm{v}=\bm{0}$ even though $\bm{v}\neq\bm{0}$ as assumed above. Fortunately such situations are avoided with high probability in Algorithm \ref{['algoStoDARS']}, since the random matrix $\bm{{\stackunder[0.6pt]{$Q$}{}}}:=\sqrt{\frac{n}{p}}\bm{{\stackunder[0.7pt]{$U$}{}}}_{n \times p}$ satisfies ${\left\lVert\bm{{\stackunder[0.6pt]{$Q$}{}}}^{\top}\bm{v}\right\rVert}_{2}\geq \alpha_{Q}{\left\lVert\bm{v}\right\rVert}_{2}$ for some $\alpha_{Q}\in (0,1)$ with high probability thanks to Theorem \ref{['JLforHaarTheor']}, when obtained from Haar distribution. Moreover, Lemma \ref{['lemKappaD3']} ensures that the illustrated behaviour of the blue hyperplane on the right (and its poll directions) with respect to $\bm{v}$ occurs with high probability, that is, with high probability there exists a random poll direction that makes an acute angle with $\bm{v}$.
Figure 2: Data profiles for convergence tolerances $\tau=10^{-2}$ and $\tau=10^{-3}$, on $1,600$ problem instances for additive noise with standard deviation $\sigma=10^{-3}$, while using $n_k=4$ noisy function evaluations with reuse of available samples from previous iterations for the computation of estimates.
Figure 3: Data profiles for convergence tolerances $\tau=10^{-2}$ and $\tau=10^{-3}$, on $1,600$ problem instances for multiplicative noise with standard deviation $\sigma=10^{-3}$, while using $n_k=4$ noisy function evaluations with reuse of available samples from previous iterations for the computation of estimates.
Figure 4: Data profiles for convergence tolerances $\tau=10^{-2}$ and $\tau=10^{-3}$, on $1,600$ problem instances for additive noise with standard deviation $\sigma=10^{-3}$, while using $n_k=25$ noisy function evaluations with reuse of available samples from previous iterations for the computation of estimates.
Figure 5: Data profiles for convergence tolerances $\tau=10^{-2}$ and $\tau=10^{-3}$, on $1,600$ problem instances for multiplicative noise with standard deviation $\sigma=10^{-3}$, while using $n_k=25$ noisy function evaluations with reuse of available samples from previous iterations for the computation of estimates.

Theorems & Definitions (56)

Definition 2.1
Proposition 2.1
proof
Remark 2.1
Lemma 2.1
Theorem 2.1
Lemma 2.2
Theorem 2.2
proof
Lemma 2.3
...and 46 more

Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity

TL;DR

Abstract

Direct search for stochastic optimization in random subspaces with zeroth-, first-, and second-order convergence and expected complexity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (56)