Table of Contents
Fetching ...

Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

Sharan Vaswani, Reza Babanezhad

TL;DR

It is proved that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L), and for non-convex objectives satisfying gradient domination, GD-LS can match the fast convergence of algorithms tailored for these specific settings.

Abstract

Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant L and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS (GD-LS) can result in constant factor improvements over GD with a 1/L step-size (denoted as GD(1/L)). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, GD-LS can result in a faster convergence rate than GD(1/L). In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L). Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), GD-LS can match the fast convergence of algorithms tailored for these specific settings. Finally, we analyze the convergence of stochastic GD with a stochastic line-search on convex losses under the interpolation assumption.

Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

TL;DR

It is proved that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L), and for non-convex objectives satisfying gradient domination, GD-LS can match the fast convergence of algorithms tailored for these specific settings.

Abstract

Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant L and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS (GD-LS) can result in constant factor improvements over GD with a 1/L step-size (denoted as GD(1/L)). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, GD-LS can result in a faster convergence rate than GD(1/L). In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, GD-LS can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of GD(1/L). Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), GD-LS can match the fast convergence of algorithms tailored for these specific settings. Finally, we analyze the convergence of stochastic GD with a stochastic line-search on convex losses under the interpolation assumption.

Paper Structure

This paper contains 19 sections, 43 theorems, 130 equations, 3 figures, 1 algorithm.

Key Result

proposition 1

Consider $n$ points where $x_i \in \mathbb{R}^d$ are the features and $y_i \in \{-1,1\}$ are the corresponding labels. Logistic regression with the objective satisfies assn:nus with $L_0 = 0$, $L_1 = 8\max_{i \in [n]} \left\|x_i \right\|_{2}^{2}$, and assn:positive-reverse-PL with $\nu = 8\max_{i} \left\|x_i \right\|$, $\omega = 0$.

Figures (3)

  • Figure 1: Comparing $\texttt{GD-LS}$ with $c = 1/2$ and $\eta_{\max} = 10^8$ and $\texttt{GD(1/L)}$ for unregularized logistic regression on the ijcnn dataset chang2011libsvm. $f^*$ is small and $\texttt{GD-LS}$ converges faster.
  • Figure 2: Comparing $\texttt{GD-LS}$ with $c = 1/2$, $\eta_{\max} = 10^8$ and $\texttt{GD(1/L)}$ for unregularized logistic regression on a synthetic separable dataset with $\gamma = 0.1$, $n = 10^4$ and $d = 200$. (Left) Sub-optimality plot: $\texttt{GD-LS}$ converges linearly, while $\texttt{GD(1/L)}$ has a sublinear convergence. (Right) Step-size plot: The $\texttt{GD-LS}$ step-size increases non-monotonically.
  • Figure 3: Comparing $\texttt{GD-LS}$ with $c = 1/2$, $\eta_{\max} = 10^4$ and $\texttt{GD(1/L)}$ for GLM on a synthetic dataset with $n = 10^4$, $d = 200$, $\left\|\theta^* \right\| = 1$.

Theorems & Definitions (70)

  • proposition 1
  • proposition 2
  • proposition 3
  • Lemma 1
  • theorem 1
  • corollary 1
  • theorem 2
  • corollary 2
  • proposition 4
  • corollary 3
  • ...and 60 more