Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Ya-Chi Chu; Wenzhi Gao; Yinyu Ye; Madeleine Udell

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell

TL;DR

This work provides the first rigorous convergence analysis of hypergradient descent (HDM) within an online learning framework, showing how adaptive preconditioning learns to match local landscape smoothness and accelerate convergence. By formulating hypergradient feedback $h_x(P)$ and proving sublinear/dynamic regret bounds, the authors establish global convergence and local superlinear behavior near optima, with connections to quasi-Newton methods. They address practical instability via null steps and AdaGrad, and extend HDM with heavy-ball and Nesterov momentum, culminating in HDM-Best, a diagonal-preconditioned, AdaGrad-enhanced variant that competes with L-BFGS on convex problems while using less memory. The empirical results on deterministic convex tasks (SVM and logistic regression) demonstrate robust performance and practical viability, suggesting HDM as a competitive alternative for first-order optimization. Future work includes extending the theory to stochastic and nonconvex settings to scale to large-scale machine learning models.

Abstract

This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of HDM using the online learning framework of [Gao24] and apply this analysis to develop new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, HDM automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of HDM reported in the literature and proposes efficient strategies to address it. We also develop two HDM variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show HDM with heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, HDM-HB often matches the performance of L-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

TL;DR

Abstract

Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (33)