Glocal Hypergradient Estimation with Koopman Operator

Ryuichiro Hataya; Yoshinobu Kawahara

Glocal Hypergradient Estimation with Koopman Operator

Ryuichiro Hataya, Yoshinobu Kawahara

TL;DR

This paper tackles the inefficiency of gradient-based hyperparameter optimization by proposing glocal hypergradient estimation, which leverages the Koopman operator to infer global hypergradients from a trajectory of local hypergradients. By approximating the global gradient with a finite-dimensional linear model derived from local dynamics, the method enables greedy hyperparameter updates that combine the reliability of global optimization with the speed of local updates. Theoretical analysis provides complexity bounds and an error guarantee relative to the true global gradient, while experiments on optimizer hyperparameters and data reweighting show performance close to global methods but with substantial efficiency gains. The approach offers a scalable framework for bi-level optimization in deep learning, with potential extensions to stochastic and more complex meta-learning settings.

Abstract

Gradient-based hyperparameter optimization methods update hyperparameters using hypergradients, gradients of a meta criterion with respect to hyperparameters. Previous research used two distinct update strategies: optimizing hyperparameters using global hypergradients obtained after completing model training or local hypergradients derived after every few model updates. While global hypergradients offer reliability, their computational cost is significant; conversely, local hypergradients provide speed but are often suboptimal. In this paper, we propose *glocal* hypergradient estimation, blending "global" quality with "local" efficiency. To this end, we use the Koopman operator theory to linearize the dynamics of hypergradients so that the global hypergradients can be efficiently approximated only by using a trajectory of local hypergradients. Consequently, we can optimize hyperparameters greedily using estimated global hypergradients, achieving both reliability and efficiency simultaneously. Through numerical experiments of hyperparameter optimization, including optimization of optimizers, we demonstrate the effectiveness of the glocal hypergradient estimation.

Glocal Hypergradient Estimation with Koopman Operator

TL;DR

Abstract

Paper Structure (16 sections, 14 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 14 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background
Gradient-based Bi-level Optimization
Computation of Hypergradients
Koopman Operator Theory
Glocal Hypergradient Estimation
Experiments
Optimizing Optimizer Hyperparameters
Data Reweighting
Analysis
Discussion and Conclusion
Additional Discussion on \ref{['thm:error']}
Detailed Experimental Settings
Optimizing Optimizer Hyperparameters
Data reweighting
...and 1 more sections

Figures (8)

Figure 1: The schematic view of hypergradients. We want to use global hypergradient to update hyperparameters, but it needs to wait for the completion of the entire training process. Updating hyperparameters ${\bm \phi}_{s}$ with a local hypergradient is efficient but may lead to suboptimal solutions. To leverage both advantages, we propose glocal hypergradient estimation that approximates global hypergradient using a local hypergradient trajectory for $t\in[(s-1)\tau+1,\dots,s\tau]$, enabling to update ${\bm \phi}_{s}$ using ${\bm{h}}_T$ efficiently.
Figure 2: Test accuracy and the transition of the hyperparameters of SGD and Adam. The proposed local approach shows similar hyperparameter development to the global baseline.
Figure 3: The transition of the learning rate hyperparameter and test accuracy curves of WideResNet 28-2 on CIFAR-10.
Figure 4: Left The eigenvalues obtained by the Hankel DMD for a hypergradient trajectory. Eigenvalues nearly close to $1$ are highlighted in orange. Middle The magnitude of modes of the estimated hypergradient corresponding to the learning rate $b_j\lambda_j^t u_j$ for $j: \lambda_j\neq 1$ over 100 iterations of $t$. Right The comparison of validation performance of the proposed estimation with different configurations. LeNet is trained on FMNIST with an initial learning rate of $0.1$.
Figure C.1: The transition of the SGD's hyperparameters and test accuracy curves of LeNet on MNIST with an initial learning rate of 0.01.
...and 3 more figures

Glocal Hypergradient Estimation with Koopman Operator

TL;DR

Abstract

Glocal Hypergradient Estimation with Koopman Operator

Authors

TL;DR

Abstract

Table of Contents

Figures (8)