Table of Contents
Fetching ...

A Hessian-informed hyperparameter optimization for differential learning rate

Shiyun Xu, Zhiqi Bu, Yiliang Zhang, Ian Barnett

TL;DR

The paper addresses the hyperparameter optimization challenge of differential learning rates (DLR) by introducing Hessian-informed DLR (Hi-DLR), which leverages curvature information to compute per-group learning rates adaptively. Building on a second-order, next-loss minimization framework, Hi-DLR derives an optimal rate vector $\bm{\eta}_{[K]}^{*}$ using matrices $\mathbf{A}_*$ and $\mathbf{b}_*$, and employs a diagonal approximation to achieve $O(K)$ complexity per update, with infrequent updates approaching $O(1)$. A backpropagation-free estimation procedure computes $\mathbf{A}_*$ and $\mathbf{b}_*$ efficiently, making Hi-DLR compatible with any optimizer and parameter grouping, including PEFT methods like LoRA. Empirical results across natural language understanding, image classification, multi-task learning, and NAM regression demonstrate that Hi-DLR improves convergence and can outperform both uniform LR and prior Hessian-based LR methods, with the best gains arising when parameter groups reflect distinct loss curvature. This work broadens the practical applicability of curvature-aware optimizers by providing a scalable, broadly applicable HPO mechanism for differential learning rates.

Abstract

Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.

A Hessian-informed hyperparameter optimization for differential learning rate

TL;DR

The paper addresses the hyperparameter optimization challenge of differential learning rates (DLR) by introducing Hessian-informed DLR (Hi-DLR), which leverages curvature information to compute per-group learning rates adaptively. Building on a second-order, next-loss minimization framework, Hi-DLR derives an optimal rate vector using matrices and , and employs a diagonal approximation to achieve complexity per update, with infrequent updates approaching . A backpropagation-free estimation procedure computes and efficiently, making Hi-DLR compatible with any optimizer and parameter grouping, including PEFT methods like LoRA. Empirical results across natural language understanding, image classification, multi-task learning, and NAM regression demonstrate that Hi-DLR improves convergence and can outperform both uniform LR and prior Hessian-based LR methods, with the best gains arising when parameter groups reflect distinct loss curvature. This work broadens the practical applicability of curvature-aware optimizers by providing a scalable, broadly applicable HPO mechanism for differential learning rates.

Abstract

Differential learning rate (DLR), a technique that applies different learning rates to different model parameters, has been widely used in deep learning and achieved empirical success via its various forms. For example, parameter-efficient fine-tuning (PEFT) applies zero learning rates to most parameters so as to significantly save the computational cost. At the core, DLR leverages the observation that different parameters can have different loss curvature, which is hard to characterize in general. We propose the Hessian-informed differential learning rate (Hi-DLR), an efficient approach that solves the hyperparameter optimization (HPO) of learning rates and captures the loss curvature for any model and optimizer adaptively. Given a proper grouping of parameters, we empirically demonstrate that Hi-DLR can improve the convergence by dynamically determining the learning rates during the training.
Paper Structure (34 sections, 15 equations, 13 figures, 4 tables)

This paper contains 34 sections, 15 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Hi-DLR outperforms manual ULR and DLR on multiple tasks optimized with LoRA (see experiment details in \ref{['app:lora']}). Left: synthetic regression. Middle & right: text classification on the CoLA dataset in terms of accuracy and learning rate ratio $\eta_B/\eta_A$.
  • Figure 2: Optimizing over 2D test functions. The left two plots are the results of optimizing an ellipse function; the right two plots show the optimization on a function that is the sum of Beale and Rosenbrock. Hi-DLR is our method; Hi-ULR recovers GeN; the rest uses a manually selected learning rate. See experiment details in \ref{['app:syntheticDLR']}.
  • Figure 3: Second-order Taylor approximation in equation \ref{['eq:lr parabola diag']} is sufficiently accurate. We visualize losses with two-group Hi-DLR (bias) under the settings in \ref{['sec:cv']}. Left&Middle: $L(\bm{w}_{(1)}-\xi_j\mathbf{g}_{(1)})$ and $L(\bm{w}_{(2)}-\xi_j\mathbf{g}_{(2)})$ in dots at iteration 200. Solid lines are the fitted quadratic functions, with minimizer marked by dashed vertical lines. Right: the loss truth is the left side of equation \ref{['eq:lr parabola diag']} plus $L(\bm{w})$, and the loss prediction is the right side of equation \ref{['eq:lr parabola diag']} plus $L(\bm{w})$.
  • Figure 4: Fine-tuning results on CelebA. From left to right, the first panel shows the average train loss over 40 labels; the second panel shows their average test accuracy; the third and fourth panels are two individual test losses of two labels. See the results of all 40 tasks in \ref{['app:celeba']}.
  • Figure 5: Loss and learning rate of NAM on two regression tasks. The first row is the synthetic dataset. The second row is the California Housing dataset. From left to right, the first two plots show the training losses, and test losses, where the grey lines are results trained with a list of manually picked learning rates, the blue curves correspond to Hi-ULR, and the red curves correspond to Hi-DLR; the last plot shows the learning rates for different groups.
  • ...and 8 more figures