A Hessian-informed hyperparameter optimization for differential learning rate
Shiyun Xu, Zhiqi Bu, Yiliang Zhang, Ian Barnett
TL;DR
The paper addresses the hyperparameter optimization challenge of differential learning rates (DLR) by introducing Hessian-informed DLR (Hi-DLR), which leverages curvature information to compute per-group learning rates adaptively. Building on a second-order, next-loss minimization framework, Hi-DLR derives an optimal rate vector $\bm{\eta}_{[K]}^{*}$ using matrices $\mathbf{A}_*$ and $\mathbf{b}_*$, and employs a diagonal approximation to achieve $O(K)$ complexity per update, with infrequent updates approaching $O(1)$. A backpropagation-free estimation procedure computes $\mathbf{A}_*$ and $\mathbf{b}_*$ efficiently, making Hi-DLR compatible with any optimizer and parameter grouping, including PEFT methods like LoRA. Empirical results across natural language understanding, image classification, multi-task learning, and NAM regression demonstrate that Hi-DLR improves convergence and can outperform both uniform LR and prior Hessian-based LR methods, with the best gains arising when parameter groups reflect distinct loss curvature. This work broadens the practical applicability of curvature-aware optimizers by providing a scalable, broadly applicable HPO mechanism for differential learning rates.
Abstract
Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
