Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence
Ilyas Fatkhullin, Niao He
TL;DR
The paper advances nonconvex stochastic optimization by enabling Stochastic Mirror Descent (SMD) with general, including nonsmooth, Bregman DGFs and analyzes convergence via the Bregman Forward-Backward Envelope (BFBE). It develops a Lyapunov-based framework under $(\ell,\omega)$-relative smoothness and bounded gradient noise, yielding convergence in expectation, high-probability guarantees, and global convergence under a generalized Proximal PL condition, without requiring large minibatches or Euclidean smoothness. The theory is instantiated in machine learning contexts such as differential privacy, reinforcement learning, and training linear networks, demonstrating improved or dimension-robust performance by leveraging nonsmooth DGFs like Shannon entropy. These results broaden the scope of SMD, enabling more flexible geometry and potentially stronger guarantees in nonconvex stochastic learning tasks. The practical impact includes simpler DP algorithms with near-dimension-free utility bounds and RL methods with reduced dependence on action-space size, alongside provably convergent non-Euclidean SMD schemes for deep models.
Abstract
This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-Łojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.
