Table of Contents
Fetching ...

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell

TL;DR

This work provides the first rigorous convergence analysis of hypergradient descent (HDM) within an online learning framework, showing how adaptive preconditioning learns to match local landscape smoothness and accelerate convergence. By formulating hypergradient feedback $h_x(P)$ and proving sublinear/dynamic regret bounds, the authors establish global convergence and local superlinear behavior near optima, with connections to quasi-Newton methods. They address practical instability via null steps and AdaGrad, and extend HDM with heavy-ball and Nesterov momentum, culminating in HDM-Best, a diagonal-preconditioned, AdaGrad-enhanced variant that competes with L-BFGS on convex problems while using less memory. The empirical results on deterministic convex tasks (SVM and logistic regression) demonstrate robust performance and practical viability, suggesting HDM as a competitive alternative for first-order optimization. Future work includes extending the theory to stochastic and nonconvex settings to scale to large-scale machine learning models.

Abstract

This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of HDM using the online learning framework of [Gao24] and apply this analysis to develop new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, HDM automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of HDM reported in the literature and proposes efficient strategies to address it. We also develop two HDM variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

TL;DR

This work provides the first rigorous convergence analysis of hypergradient descent (HDM) within an online learning framework, showing how adaptive preconditioning learns to match local landscape smoothness and accelerate convergence. By formulating hypergradient feedback and proving sublinear/dynamic regret bounds, the authors establish global convergence and local superlinear behavior near optima, with connections to quasi-Newton methods. They address practical instability via null steps and AdaGrad, and extend HDM with heavy-ball and Nesterov momentum, culminating in HDM-Best, a diagonal-preconditioned, AdaGrad-enhanced variant that competes with L-BFGS on convex problems while using less memory. The empirical results on deterministic convex tasks (SVM and logistic regression) demonstrate robust performance and practical viability, suggesting HDM as a competitive alternative for first-order optimization. Future work includes extending the theory to stochastic and nonconvex settings to scale to large-scale machine learning models.

Abstract

This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of HDM using the online learning framework of [Gao24] and apply this analysis to develop new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, HDM automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of HDM reported in the literature and proposes efficient strategies to address it. We also develop two HDM variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

Paper Structure

This paper contains 63 sections, 20 theorems, 116 equations, 9 figures, 1 table, 4 algorithms.

Key Result

Lemma 2.1

For any $x \not\in \mathcal{X}^{\star}$.

Figures (9)

  • Figure 1: The behavior of different HDM variants on the toy quadratic optimization problem. \ref{['fig:demo:a']}: two-phase convergence behavior of vanilla HDM. \ref{['fig:demo:b']}: effect of null step and our best variant HDM-Best.
  • Figure 2: Addressing instability of HDM
  • Figure 3: The convergence behavior of HDM-HB and HDM-AGD on a toy quadratic problem. \ref{['fig:demo:c']}: HDM-HB. \ref{['fig:demo:d']}: HDM with Nesterov momentum.
  • Figure 4: Support vector-machine problems. First row: function value gap. Second row: gradient norm.
  • Figure 5: Logistic regression problems. First row: function value gap. Second row: gradient norm.
  • ...and 4 more figures

Theorems & Definitions (33)

  • Lemma 2.1: Extension of Proposition 6.1 in gao2024gradient
  • Lemma 2.2: Sublinear regret gao2024gradient
  • Lemma 2.3: Logarithmic regret
  • Remark 1
  • Lemma 2.4: Sharper version of Lemma 6.1 in gao2024gradient
  • Theorem 3.1: Static adaptivity
  • Theorem 3.2: Dynamic adaptivity
  • Theorem 3.3: Local superlinear convergence
  • Lemma 3.1
  • Theorem 3.4: Convergence of the preconditioner
  • ...and 23 more