A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness

Zhenyu Sun; Ermin Wei

A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness

Zhenyu Sun, Ermin Wei

TL;DR

This paper proposes a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness, which can achieve the best-known $\mathcal{O}\left(\epsilon^{-3}\right)$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning.

Abstract

Classical convergence analyses for optimization algorithms rely on the widely-adopted uniform smoothness assumption. However, recent experimental studies have demonstrated that many machine learning problems exhibit non-uniform smoothness, meaning the smoothness factor is a function of the model parameter instead of a universal constant. In particular, it has been observed that the smoothness grows with respect to the gradient norm along the training trajectory. Motivated by this phenomenon, the recently introduced $(L_0, L_1)$-smoothness is a more general notion, compared to traditional $L$-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal $\mathcal{O}(ε^{-3})$ sample complexity for finding an $ε$-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known $\mathcal{O}(ε^{-3})$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.

A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness

TL;DR

This paper proposes a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness, which can achieve the best-known

sample complexity and enjoys convergence speedup with simple hyperparameter tuning.

Abstract

-smoothness is a more general notion, compared to traditional

-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal

sample complexity for finding an

-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known

sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.

Paper Structure (23 sections, 12 theorems, 70 equations, 3 figures, 1 table, 2 algorithms)

This paper contains 23 sections, 12 theorems, 70 equations, 3 figures, 1 table, 2 algorithms.

Introduction
Related work
SGD-based methods in non-convex optimization
SQN methods in non-convex optimization
Gradient clipping and non-uniform smoothness
Our contributions
Preliminaries
Optimality condition
$(L_0, L_1)$-smoothness
Examples with $(L_0, L_1)$-smooth properties
A Clipped Stochastic Quasi-Newton Method
Generating $H_k$ with Controllable $(\lambda_m, \lambda_M)$
Stochastic adaptive BFGS method
Stochastic adaptive L-BFGS method
Experiments
...and 8 more sections

Key Result

Proposition 2.3

Consider $F(x) = y \log(\hat{y})$, where $\hat{y} = \sigma(u^T x)$ with $\sigma(\cdot)$ being the sigmoid function and $y, u$ are constant scalars or vectors with suitable dimensions. Then, $F(x)$ is $(L_0, \Vert u \Vert)$-smooth for any $L_0 > 0$.

Figures (3)

Figure 1: Smoothness increases with gradient norm along the training trajectory (figure taken from zhang2019gradient
Figure 2: Training errors for algorithms solving non-convex robust linear regression problem
Figure 3: Training errors for algorithms solving non-convex logistic regression problem

Theorems & Definitions (21)

Definition 2.1
Definition 2.2
Proposition 2.3
Proposition 2.4
Theorem 3.5
Remark 3.6
Lemma 4.1
Lemma 4.2
Remark 4.4
Theorem 4.5
...and 11 more

A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness

TL;DR

Abstract

A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)