Table of Contents
Fetching ...

First-Order Algorithms Without Lipschitz Gradient: A Sequential Local Optimization Approach

Junyu Zhang, Mingyi Hong

TL;DR

A sequential local optimization framework for first-order algorithms to optimize problems without Lipschitz gradient, which provides the first nonasymptotic convergence rate for the (slight variant of) Armijo linesearch algorithm without globally Lipschitz continuous gradient or convexity.

Abstract

First-order algorithms have been popular for solving convex and non-convex optimization problems. A key assumption for the majority of these algorithms is that the gradient of the objective function is globally Lipschitz continuous, but many contemporary problems such as tensor decomposition fail to satisfy such an assumption. This paper develops a sequential local optimization (SLO) framework of first-order algorithms that can effectively optimize problems without Lipschitz gradient. Operating on the assumption that the gradients are {\it locally} Lipschitz continuous over any compact set, the proposed framework carefully restricts the distance between two successive iterates. We show that the proposed framework can easily adapt to existing first-order methods such as gradient descent (GD), normalized gradient descent (NGD), accelerated gradient descent (AGD), as well as GD with Armijo line search. Remarkably, the latter algorithm is totally parameter-free and do not even require the knowledge of local Lipschitz constants. We show that for the proposed algorithms to achieve gradient error bound of $\|\nabla f(x)\|^2\le ε$, it requires at most $\mathcal{O}(\frac{1}ε\times \mathcal{L}(Y))$ total access to the gradient oracle, where $\mathcal{L}(Y)$ characterizes how the local Lipschitz constants grow with the size of a given set $Y$. Moreover, we show that the variant of AGD improves the dependency on both $ε$ and the growth function $\mathcal{L}(Y)$. The proposed algorithms complement the existing Bregman Proximal Gradient (BPG) algorithm, because they do not require the global information about problem structure to construct and solve Bregman proximal mappings.

First-Order Algorithms Without Lipschitz Gradient: A Sequential Local Optimization Approach

TL;DR

A sequential local optimization framework for first-order algorithms to optimize problems without Lipschitz gradient, which provides the first nonasymptotic convergence rate for the (slight variant of) Armijo linesearch algorithm without globally Lipschitz continuous gradient or convexity.

Abstract

First-order algorithms have been popular for solving convex and non-convex optimization problems. A key assumption for the majority of these algorithms is that the gradient of the objective function is globally Lipschitz continuous, but many contemporary problems such as tensor decomposition fail to satisfy such an assumption. This paper develops a sequential local optimization (SLO) framework of first-order algorithms that can effectively optimize problems without Lipschitz gradient. Operating on the assumption that the gradients are {\it locally} Lipschitz continuous over any compact set, the proposed framework carefully restricts the distance between two successive iterates. We show that the proposed framework can easily adapt to existing first-order methods such as gradient descent (GD), normalized gradient descent (NGD), accelerated gradient descent (AGD), as well as GD with Armijo line search. Remarkably, the latter algorithm is totally parameter-free and do not even require the knowledge of local Lipschitz constants. We show that for the proposed algorithms to achieve gradient error bound of , it requires at most total access to the gradient oracle, where characterizes how the local Lipschitz constants grow with the size of a given set . Moreover, we show that the variant of AGD improves the dependency on both and the growth function . The proposed algorithms complement the existing Bregman Proximal Gradient (BPG) algorithm, because they do not require the global information about problem structure to construct and solve Bregman proximal mappings.

Paper Structure

This paper contains 25 sections, 19 theorems, 133 equations, 4 figures, 3 tables.

Key Result

Lemma 2.3

Let $\{x^\tau_k\}^{K_\tau}_{k=0}$ be generated by Algorithm alg:Meta as epoch $\tau$, with $\|x^\tau_{\!K_\tau}\!-\!x^\tau_{0}\|\!\in\!\left[D-d,\! D\right]$. If Condition condition:sufficient holds and $D\!\geq\!\sqrt{\frac{C_2^\tau}{4C_1^\tau}} \!+\! 2d$, then $f(x^\tau_{K_\tau}) \!-\! f(x^\tau_0)

Figures (4)

  • Figure 1: Numerical results on symmetric tensor decomposition, over representative initial points.
  • Figure 2: Solving \ref{['prob:BPG-sub']}
  • Figure 3: Numerical experiments on unsupervised training of linear autoencoder.
  • Figure 4: Numerical experiments on supervised training of linear neural networks.

Theorems & Definitions (32)

  • Example 1.1: Tensor Decomposition
  • Example 1.2: Unsupervised Autoencoder Training
  • Example 1.3: Supervised Neural Network Training
  • Lemma 2.3: Per-epoch descent
  • Theorem 2.4
  • Definition 2.5: Gradient projection subroutine
  • Lemma 2.6
  • Corollary 2.7
  • Theorem 2.8
  • Definition 2.9: Truncated gradient descent subroutine
  • ...and 22 more