Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

Ilyas Fatkhullin; Niao He

Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

Ilyas Fatkhullin, Niao He

TL;DR

The paper advances nonconvex stochastic optimization by enabling Stochastic Mirror Descent (SMD) with general, including nonsmooth, Bregman DGFs and analyzes convergence via the Bregman Forward-Backward Envelope (BFBE). It develops a Lyapunov-based framework under $(\ell,\omega)$-relative smoothness and bounded gradient noise, yielding convergence in expectation, high-probability guarantees, and global convergence under a generalized Proximal PL condition, without requiring large minibatches or Euclidean smoothness. The theory is instantiated in machine learning contexts such as differential privacy, reinforcement learning, and training linear networks, demonstrating improved or dimension-robust performance by leveraging nonsmooth DGFs like Shannon entropy. These results broaden the scope of SMD, enabling more flexible geometry and potentially stronger guarantees in nonconvex stochastic learning tasks. The practical impact includes simpler DP algorithms with near-dimension-free utility bounds and RL methods with reduced dependence on action-space size, alongside provably convergent non-Euclidean SMD schemes for deep models.

Abstract

This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-Łojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.

Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

TL;DR

-relative smoothness and bounded gradient noise, yielding convergence in expectation, high-probability guarantees, and global convergence under a generalized Proximal PL condition, without requiring large minibatches or Euclidean smoothness. The theory is instantiated in machine learning contexts such as differential privacy, reinforcement learning, and training linear networks, demonstrating improved or dimension-robust performance by leveraging nonsmooth DGFs like Shannon entropy. These results broaden the scope of SMD, enabling more flexible geometry and potentially stronger guarantees in nonconvex stochastic learning tasks. The practical impact includes simpler DP algorithms with near-dimension-free utility bounds and RL methods with reduced dependence on action-space size, alongside provably convergent non-Euclidean SMD schemes for deep models.

Abstract

Paper Structure (25 sections, 22 theorems, 113 equations, 1 figure)

This paper contains 25 sections, 22 theorems, 113 equations, 1 figure.

INTRODUCTION
Related Work
Contributions
Our Techniques.
PRELIMINARIES
FOSP Measures
ASSUMPTIONS
MAIN RESULTS
Connections between FOSP Measures
Convergence to FOSP in Expectation
High Probability Convergence to FOSP under Sub-Gaussian Noise
Global Convergence under Generalized Proximal PŁ condition
NEW INSIGHTS FOR MACHINE LEARNING
DP Learning in $\ell_2$ and $\ell_1$ Settings
Policy Optimization in Reinforcement Learning (RL)
...and 10 more sections

Key Result

Lemma 4.1

Let $F(\cdot)$ be $(\ell, \omega)$-smooth and $\sqrt{D_{\omega}^{\text{sym}}(x, y)}$ be a metric. Then for any $x \in \mathcal{X} \cap \mathcal{S}$, and $\rho, s > 0$ such that $\rho > \ell / s + 2 \ell$, it holds where $C(\ell, \rho, s) := \frac{(1+s)(\rho - \ell) + (1 + s^{-1}) \ell }{\rho - \ell - (1+s^{-1}) \ell }$. In particular, for $s = 1$, $\rho = 4 \ell$, we have $C(\ell, \rho, s) = 8$

Figures (1)

Figure 1: Sensitivity to step-size choice for SMDr1, SMDr2, SGD and Clip SGD (with clipping radius $1$). The plot shows the function value $F(x_T)$ after $T=10^4$ iterations for each step-size. The star markers correspond to the actual runs, and the lines linearly interpolate between them.

Theorems & Definitions (39)

Lemma 4.1: BPM $\approx$ BGM
Lemma 4.2: BFBE $>$ BGM
Theorem 4.3
proof : Proof sketch:
Theorem 4.5
Theorem 4.7
Definition 5.1: $(\epsilon, \delta)$-DP dwork2006calibrating
Lemma 5.2: Theorem 1 in abadi2016deep
Corollary 5.3
Corollary 5.4
...and 29 more

Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

TL;DR

Abstract

Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (39)