Zeroth-Order Methods for Stochastic Nonconvex Nonsmooth Composite Optimization
Ziyi Chen, Peiran Yu, Heng Huang
TL;DR
This paper studies stochastic nonconvex nonsmooth composite optimization of the form $\phi(x)=F(x)+h(x)$ with $F(x)=\mathbb{E}_{\xi}[f_{\xi}(x)]$ and $h$ convex, addressing the lack of smoothness in key machine learning problems. It introduces two finite-time, zeroth-order stationary notions—proximal Goldstein stationary point (PGSP) and conditional gradient Goldstein stationary point (CGGSP)—based on the Goldstein $\delta$-subdifferential, and develops two zeroth-order methods: 0-PGD (proximal gradient) and 0-GCG (generalized conditional gradient). The paper provides convergence rates and function-evaluation complexities for both algorithms under minibatch and variance-reduced gradient estimations, showing that they can reach the proposed stationary points in finite time. Theoretical results are complemented by experiments on synthetic regularized ReLU networks and ResNet-20 on CIFAR-10, illustrating practical effectiveness and highlighting trade-offs between proximal steps and LMOs in different settings. Overall, the work broadens finite-time guarantees for nonconvex nonsmooth stochastic optimization by leveraging Goldstein-based stationarity and zeroth-order estimation.
Abstract
This work aims to solve a stochastic nonconvex nonsmooth composite optimization problem. Previous works on composite optimization problem requires the major part to satisfy Lipschitz smoothness or some relaxed smoothness conditions, which excludes some machine learning examples such as regularized ReLU network and sparse support matrix machine. In this work, we focus on stochastic nonconvex composite optimization problem without any smoothness assumptions. In particular, we propose two new notions of approximate stationary points for such optimization problem and obtain finite-time convergence results of two zeroth-order algorithms to these two approximate stationary points respectively. Finally, we demonstrate that these algorithms are effective using numerical experiments.
