Stochastic Average Model Methods

Matt Menickelly; Stefan M. Wild

Stochastic Average Model Methods

Matt Menickelly, Stefan M. Wild

TL;DR

This work considers the solution of finite-sum minimization problems, such as those appearing in nonlinear least-squares or general empirical risk minimization problems, and presents the idea of stochastic average model (SAM) methods, inspired by stochastic average gradient methods.

Abstract

We consider the solution of finite-sum minimization problems, such as those appearing in nonlinear least-squares or general empirical risk minimization problems. We are motivated by problems in which the summand functions are computationally expensive and evaluating all summands on every iteration of an optimization method may be undesirable. We present the idea of stochastic average model (SAM) methods, inspired by stochastic average gradient methods. SAM methods sample component functions on each iteration of a trust-region method according to a discrete probability distribution on component functions; the distribution is designed to minimize an upper bound on the variance of the resulting stochastic model. We present promising numerical results concerning an implemented variant extending the derivative-free model-based trust-region solver POUNDERS, which we name SAM-POUNDERS.

Stochastic Average Model Methods

TL;DR

Abstract

Paper Structure (28 sections, 5 theorems, 74 equations, 16 figures, 4 algorithms)

This paper contains 28 sections, 5 theorems, 74 equations, 16 figures, 4 algorithms.

Introduction
Stochastic Average Model Methods
Ameliorated Models $\hat{m}_{I^k}$ and $\hat{m}_{J^k}$
Variance of $\hat{m}_{I^k}$
A proposed method for choosing probabilities $\pi_i^k$ given a fixed batch size
Additional models beyond \ref{['eq:first_order']}
Linear interpolation models
Gauss--Newton models \ref{['eq:gauss_newton']}
Zeroth-order Gauss--Newton models \ref{['eq:zero_gauss_newton']}
Convergence guarantees
Numerical Experiments
Test problems
Logistic loss function
Generalized Rosenbrock functions
Cube functions
...and 13 more sections

Key Result

Proposition 1

For all samplings defined by $\pi_i^k>0$, $i=1, \ldots,p$, and for all $\bm{x} \in \mathbb{R}^n$, the ameliorated model in eq:saga_model satisfies $\mathbb{E}_{I^k}\left[ \hat{m}_{I^k}(\bm{x})\right] = m^k(\bm{x}).$

Figures (16)

Figure 4.1: Statistics of a single run of \ref{['alg:dfotr']} with first-order models \ref{['eq:first_order']} for each of the three different modes of problem data generation for logistic loss functions. In each of the three pairs of figures, the left figure juxtaposes the optimality gap $f(x^k)-f(x^*)$ on top of the sparsity pattern of the evaluations $(F_i(x^k),\nabla F_i(x^k))$ performed in the $k$th point queried by the algorithm. The histogram in the right figure of each pair illustrates a sum of the corresponding sparsity pattern, namely, the total number of ($F_i(x),\nabla F_i(x)$) evaluations performed.
Figure 4.2: Statistics of a single run of \ref{['alg:dfotr']} using POUNDERS routines for model building for each of the three different modes of problem data generation for the generalized Rosenbrock function. The interpretation of the plots is the same as in \ref{['fig:visualize_fo']} except that we now perform only function evaluations (as opposed to gradient evaluations) at a queried point $x^k$.
Figure 4.3: Comparing SAG-LS (Lipschitz) with SAM-FO with dynamic batch sizes on logistic loss problems with left) balanced data generation, center) progressive data generation, and right) imbalanced data generation. Solid lines and markers denote median performance across the 90 problems (30 random datasets $\times$ 3 random seeds per dataset), while the outer bands denote $25^{th}--75^{th}$ percentile performance. We note that on the $x$-axis, $f(\bm{x}^k)-f(\bm{x}^*)$ is an appropriate metric because these logistic loss test problems are strongly convex.
Figure 4.4: Comparing the performance of SAM-FO with itself when using uniform generation of batches of a fixed resource-size $r$ versus generating batches according to \ref{['alg:dynamic_batchsize']} with parameter $r$. We show results using the same percentile bands as in \ref{['fig:sag_experiments']} and separate results by the mode of generating the dataset (balanced, progressive, or imbalanced Lipschitz constants).
Figure 4.5: For each mode of generating random tested logistic loss problems, we show the median, over the problems $\pi$, of $\log_2(R_{r,\pi,\tau,\mu})$. The top row displays results for a convergence tolerance $\tau=10^{-3}$, and the bottom row displays results for the tighter convergence tolerance $\tau=10^{-7}$.
...and 11 more figures

Theorems & Definitions (12)

Proposition 1
proof
Proposition 2
proof
Proposition 3
Theorem 1
Definition 1.1
Definition 1.2
Definition 1.3
Definition 1.4
...and 2 more

Stochastic Average Model Methods

TL;DR

Abstract

Stochastic Average Model Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (12)