Table of Contents
Fetching ...

Convergence of a class of gradient-free optimisation schemes when the objective function is noisy, irregular, or both

Christophe Andrieu, Nicolas Chopin, Ettore Fincato, Mathieu Gerber

TL;DR

This work develops a unified convergence theory for gradient-free optimization when the objective is noisy, irregular, or both, by embedding the problem into a time-inhomogeneous gradient descent on a smooth approximation $\mathcal{L}_\gamma$. It shows that with appropriate decay of the smoothing parameter $\gamma_n$ and step size $\beta_n$, the gradient norm $\|\nabla L_{\gamma_n}(\theta_n)\|$ tends to zero almost surely in the stochastic setting (and under milder conditions in the deterministic setting), and that epi-convergence links the smoothed problems to the original objective as $\gamma_n\to0$. The framework encompasses mollification and model-based search, connects to extensive literature on Gaussian smoothing and Bayes-Laplace interpretations, and is illustrated on a discontinuous AUC-risk minimization task with practical experiments. The results provide guidance on how to balance smoothing and step sizes to achieve convergence even when the objective is non-smooth or only accessible via noisy evaluations, with potential extensions to non-Gaussian kernels and state-dependent noise forthcoming.

Abstract

We investigate the convergence properties of a class of iterative algorithms designed to minimize a potentially non-smooth and noisy objective function, which may be algebraically intractable and whose values may be obtained as the output of a black box. The algorithms considered can be cast under the umbrella of a generalised gradient descent recursion, where the gradient is that of a smooth approximation of the objective function. The framework we develop includes as special cases model-based and mollification methods, two classical approaches to zero-th order optimisation. The convergence results are obtained under very weak assumptions on the regularity of the objective function and involve a trade-off between the degree of smoothing and size of the steps taken in the parameter updates. As expected, additional assumptions are required in the stochastic case. We illustrate the relevance of these algorithms and our convergence results through a challenging classification example from machine learning.

Convergence of a class of gradient-free optimisation schemes when the objective function is noisy, irregular, or both

TL;DR

This work develops a unified convergence theory for gradient-free optimization when the objective is noisy, irregular, or both, by embedding the problem into a time-inhomogeneous gradient descent on a smooth approximation . It shows that with appropriate decay of the smoothing parameter and step size , the gradient norm tends to zero almost surely in the stochastic setting (and under milder conditions in the deterministic setting), and that epi-convergence links the smoothed problems to the original objective as . The framework encompasses mollification and model-based search, connects to extensive literature on Gaussian smoothing and Bayes-Laplace interpretations, and is illustrated on a discontinuous AUC-risk minimization task with practical experiments. The results provide guidance on how to balance smoothing and step sizes to achieve convergence even when the objective is non-smooth or only accessible via noisy evaluations, with potential extensions to non-Gaussian kernels and state-dependent noise forthcoming.

Abstract

We investigate the convergence properties of a class of iterative algorithms designed to minimize a potentially non-smooth and noisy objective function, which may be algebraically intractable and whose values may be obtained as the output of a black box. The algorithms considered can be cast under the umbrella of a generalised gradient descent recursion, where the gradient is that of a smooth approximation of the objective function. The framework we develop includes as special cases model-based and mollification methods, two classical approaches to zero-th order optimisation. The convergence results are obtained under very weak assumptions on the regularity of the objective function and involve a trade-off between the degree of smoothing and size of the steps taken in the parameter updates. As expected, additional assumptions are required in the stochastic case. We illustrate the relevance of these algorithms and our convergence results through a challenging classification example from machine learning.

Paper Structure

This paper contains 31 sections, 25 theorems, 97 equations, 1 figure.

Key Result

Theorem 1

Assume that assume:Gen1-assume:Gen2 hold and let $(\theta_n)_{n\geq 1}$ be as defined in eq:theta_seq, where $\beta_n=c_\beta n^{-\iota}$ and $\gamma_n=c_\gamma n^{-\kappa}$ for all $n\geq 1$ and for some constants $(c_\beta,c_\gamma)\in (0,\infty)^2$ and $(\iota,\kappa)\in(0,1]^2$. Let $\alpha\in[0

Figures (1)

  • Figure 1: AUC score (test data) of iterate vs iteration (10 independent runs). The right panel is a zoomed-in section of the left panel, where the first 100 iterations are discarded. The baseline (dashed line) is the AUC score of the logistic regression estimate.

Theorems & Definitions (50)

  • Theorem 1
  • Theorem 2
  • Remark 1
  • Theorem 3
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Remark 2
  • Proposition 4
  • Remark 3
  • ...and 40 more