Convergence of a class of gradient-free optimisation schemes when the objective function is noisy, irregular, or both
Christophe Andrieu, Nicolas Chopin, Ettore Fincato, Mathieu Gerber
TL;DR
This work develops a unified convergence theory for gradient-free optimization when the objective is noisy, irregular, or both, by embedding the problem into a time-inhomogeneous gradient descent on a smooth approximation $\mathcal{L}_\gamma$. It shows that with appropriate decay of the smoothing parameter $\gamma_n$ and step size $\beta_n$, the gradient norm $\|\nabla L_{\gamma_n}(\theta_n)\|$ tends to zero almost surely in the stochastic setting (and under milder conditions in the deterministic setting), and that epi-convergence links the smoothed problems to the original objective as $\gamma_n\to0$. The framework encompasses mollification and model-based search, connects to extensive literature on Gaussian smoothing and Bayes-Laplace interpretations, and is illustrated on a discontinuous AUC-risk minimization task with practical experiments. The results provide guidance on how to balance smoothing and step sizes to achieve convergence even when the objective is non-smooth or only accessible via noisy evaluations, with potential extensions to non-Gaussian kernels and state-dependent noise forthcoming.
Abstract
We investigate the convergence properties of a class of iterative algorithms designed to minimize a potentially non-smooth and noisy objective function, which may be algebraically intractable and whose values may be obtained as the output of a black box. The algorithms considered can be cast under the umbrella of a generalised gradient descent recursion, where the gradient is that of a smooth approximation of the objective function. The framework we develop includes as special cases model-based and mollification methods, two classical approaches to zero-th order optimisation. The convergence results are obtained under very weak assumptions on the regularity of the objective function and involve a trade-off between the degree of smoothing and size of the steps taken in the parameter updates. As expected, additional assumptions are required in the stochastic case. We illustrate the relevance of these algorithms and our convergence results through a challenging classification example from machine learning.
