Table of Contents
Fetching ...

Gradient Methods with Online Scaling Part II. Practical Aspects

Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell

TL;DR

This work advances practical gradient-based optimization by turning stepsize selection into an online learning problem via the OSGM framework. It introduces OSGM-Best, a robust variant that blends hypergradient feedback with heavy-ball momentum and lookahead to rival quasi-Newton performance with lower memory and cheaper iterations. The paper extends OSGM to smooth nonconvex problems through stepsize-space regularization and demonstrates theoretical progress reductions under broad conditions, complemented by extensive numerical experiments on convex (e.g., SVM and logistic regression) and nonconvex benchmarks. The results establish OSGM-Best as a competitive addition to first-order methods, with clear pathways to further enhancements via BB steps, proximal settings, and performance-estimation-guided design. Overall, the work bridges online optimization ideas with practical first-order methods to yield adaptive, scalable, and performant algorithms for a wide range of problems.

Abstract

Part I of this work [Gao25] establishes online scaled gradient methods (OSGM), a framework that utilizes online convex optimization to adapt stepsizes in gradient methods. This paper focuses on the practical aspects of OSGM. We leverage the OSGM framework to design new adaptive first-order methods and provide insights into their empirical behavior. The resulting method, OSGM-Best, matches the performance of quasi-Newton variants while requiring less memory and cheaper iterations. We also extend OSGM to nonconvex optimization and outline directions that connect OSGM to existing branches of optimization theory and practice.

Gradient Methods with Online Scaling Part II. Practical Aspects

TL;DR

This work advances practical gradient-based optimization by turning stepsize selection into an online learning problem via the OSGM framework. It introduces OSGM-Best, a robust variant that blends hypergradient feedback with heavy-ball momentum and lookahead to rival quasi-Newton performance with lower memory and cheaper iterations. The paper extends OSGM to smooth nonconvex problems through stepsize-space regularization and demonstrates theoretical progress reductions under broad conditions, complemented by extensive numerical experiments on convex (e.g., SVM and logistic regression) and nonconvex benchmarks. The results establish OSGM-Best as a competitive addition to first-order methods, with clear pathways to further enhancements via BB steps, proximal settings, and performance-estimation-guided design. Overall, the work bridges online optimization ideas with practical first-order methods to yield adaptive, scalable, and performant algorithms for a wide range of problems.

Abstract

Part I of this work [Gao25] establishes online scaled gradient methods (OSGM), a framework that utilizes online convex optimization to adapt stepsizes in gradient methods. This paper focuses on the practical aspects of OSGM. We leverage the OSGM framework to design new adaptive first-order methods and provide insights into their empirical behavior. The resulting method, OSGM-Best, matches the performance of quasi-Newton variants while requiring less memory and cheaper iterations. We also extend OSGM to nonconvex optimization and outline directions that connect OSGM to existing branches of optimization theory and practice.

Paper Structure

This paper contains 74 sections, 25 theorems, 161 equations, 8 figures, 1 table, 8 algorithms.

Key Result

Lemma 2.1

Let $\kappa \geq 2$. If $\alpha_k \leq \tfrac{1}{2}$ and $| x^k_1 | \geq \sqrt{2} \kappa^{3 / 2} | x^k_2 |$, then $\alpha_{k + 1} \geq \alpha_k + \tfrac{\eta}{4}$.

Figures (8)

  • Figure 1: Spiky behavior of Vanilla OSGM-H on a quadratic function.
  • Figure 2: Left: stepsizes $\{ \alpha_k \}$ converge to $\alpha^\star = \tfrac{2}{L + \mu}$ in Classic-HDM and oscillate around $\alpha^\star = \tfrac{2}{L + \mu}$ in OSGM. Right: as $\alpha_k$ converges, Classic-HDM will converge to $x^\star$ along two directions.
  • Figure 3: Support vector-machine problems. First row: function value gap. Second row: gradient norm.
  • Figure 4: Logistic regression problems. First row: function value gap. Second row: gradient norm.
  • Figure 5: CUTEst problems. First row: function value gap. Second row: gradient norm.
  • ...and 3 more figures

Theorems & Definitions (37)

  • Lemma 2.1
  • Lemma 2.2: Orbit of the dynamical system
  • Remark 1
  • Theorem 2.1
  • Lemma 2.3: Heavy-ball potential danilova2020non
  • Lemma 2.4: Properties of heavy-ball feedback
  • Theorem 2.2: Heavy-ball reduction
  • Theorem 2.3: Global convergence
  • Lemma 3.1: Nonconvexity of $h_x$
  • Remark 2
  • ...and 27 more