Table of Contents
Fetching ...

Gradient Methods with Online Scaling

Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, Madeleine Udell

TL;DR

A framework to accelerate the convergence of gradient-based methods with online learning learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically.

Abstract

We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal scaling matrix for the iteration trajectory. For smooth strongly convex optimization, our results provide an $O(κ^\star \log(1/\varepsilon)$) complexity result, where $κ^\star$ is the condition number achievable by the optimal preconditioner, improving on the previous $O(\sqrt{n}κ^\star \log(1/\varepsilon))$ result. In particular, a variant of our method achieves superlinear convergence on convex quadratics. For smooth convex optimization, we show for the first time that the widely-used hypergradient descent heuristic improves on the convergence of gradient descent.

Gradient Methods with Online Scaling

TL;DR

A framework to accelerate the convergence of gradient-based methods with online learning learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically.

Abstract

We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal scaling matrix for the iteration trajectory. For smooth strongly convex optimization, our results provide an ) complexity result, where is the condition number achievable by the optimal preconditioner, improving on the previous result. In particular, a variant of our method achieves superlinear convergence on convex quadratics. For smooth convex optimization, we show for the first time that the widely-used hypergradient descent heuristic improves on the convergence of gradient descent.

Paper Structure

This paper contains 90 sections, 31 theorems, 138 equations, 3 figures, 1 table, 6 algorithms.

Key Result

Theorem 3.1

Given a non-negative function $\varphi (x) : \mathbb{R}^n \rightarrow \mathbb{R}_+$ and a sequence of iterations $\{ x^k \}$,

Figures (3)

  • Figure 1: Left: comparison of benchmark algorithms on toy quadratic problem. Right: superlinear convergence of OSGM-R on convex quadratics. x-axis: iteration count.
  • Figure 2: Function value gap on least squares problem with $\sigma \in \{ 10^{- 4}, 10^{- 3}, 10^{- 2}, 10^{- 1} \}$
  • Figure 3: Function value gap on the support vector machine problems

Theorems & Definitions (41)

  • Definition 2.1
  • Theorem 3.1
  • Lemma 4.1: Surrogate loss and measure
  • Proposition 4.1: Properties of $r_x$
  • Lemma 4.2: Learnability
  • Remark 1
  • Theorem 4.1: Trajectory-based convergence
  • Lemma 4.3: Hindsight
  • Corollary 4.1: Global convergence
  • Theorem 4.2: Refined global convergence
  • ...and 31 more