Table of Contents
Fetching ...

Stochastic Weakly Convex Optimization Beyond Lipschitz Continuity

Wenzhi Gao, Qi Deng

TL;DR

Based on new adaptive regularization strategies, it is shown that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the $\mathcal{O} ( 1 / \sqrt{K})$ convergence rate with constant failure rate.

Abstract

This paper considers stochastic weakly convex optimization without the standard Lipschitz continuity assumption. Based on new adaptive regularization (stepsize) strategies, we show that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the $\mathcal{O} ( 1 / \sqrt{K})$ convergence rate with constant failure rate. Our analyses rest on rather weak assumptions: the Lipschitz parameter can be either bounded by a general growth function of $\|x\|$ or locally estimated through independent random samples.

Stochastic Weakly Convex Optimization Beyond Lipschitz Continuity

TL;DR

Based on new adaptive regularization strategies, it is shown that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the convergence rate with constant failure rate.

Abstract

This paper considers stochastic weakly convex optimization without the standard Lipschitz continuity assumption. Based on new adaptive regularization (stepsize) strategies, we show that a wide class of stochastic algorithms, including the stochastic subgradient method, preserve the convergence rate with constant failure rate. Our analyses rest on rather weak assumptions: the Lipschitz parameter can be either bounded by a general growth function of or locally estimated through independent random samples.
Paper Structure (53 sections, 23 theorems, 145 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 53 sections, 23 theorems, 145 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.1

Suppose that A1 to A5 as well as B1 holds, then given $\rho > \kappa + \tau, \gamma_k > \rho$, $\mathbb{E}_k [\psi_{1 / \rho} (x^{k + 1})] \leq \psi_{1 / \rho} (x^k) -\tfrac{\rho (\rho - \tau - \kappa)}{2 (\gamma_k - \kappa)} \| \hat{x}^k - x^k \|^2 + \tfrac{2 \rho L_f^2}{(\gamma_k - \rho) (\gamma_

Figures (5)

  • Figure 1: $f(x) = |e^x + e^{-x} - 3|$ exhibits exponential growth as $\|x\| \rightarrow + \infty$
  • Figure 2: Problem $r_1$. Left two: $(\kappa,p_{\text{fail}})=(10,0.2)$; Right two: $(\kappa,p_{\text{fail}})=(10,0.3)$. x-axis: parameter $\theta$; y-axis: number of iterations. SGD denotes vanilla SGD; SGD-G denotes SGD adaptive to known Lipschitzness; SGD-R denotes SGD adaptive to unknown Lipschitzness. The same applies to SPL.
  • Figure 3: Problem $r_2$. Left two: $(\kappa,p_{\text{fail}})=(1,0.2)$; Right two: $(\kappa,p_{\text{fail}})=(10,0.3)$. x-axis: parameter $\theta$; y-axis: number of iterations.
  • Figure 4: Problem $r_3$. Left two: $(\kappa,p_{\text{fail}})=(1,0.2)$; Right two: $(\kappa,p_{\text{fail}})=(1,0.3)$. x-axis: parameter $\theta$; y-axis: number of iterations.
  • Figure 5: Left two: Problem $r_1$, $(\kappa,p_{\text{fail}})=(1,0.3)$; Right two: Problem $r_2$, $(\kappa,p_{\text{fail}})=(1,0.3)$. x-axis: parameter $\theta$; y-axis: number of iterations.

Theorems & Definitions (36)

  • Remark 1
  • Remark 2
  • Lemma 3.1
  • Theorem 3.1
  • Example 4.1: Phase retrieval
  • Example 4.2: Subgradient method
  • Lemma 4.1
  • Theorem 4.1
  • Lemma 4.2: Informal
  • Lemma 4.3
  • ...and 26 more