Table of Contents
Fetching ...

Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

Ross M. Clarke, José Miguel Hernández-Lobato

TL;DR

This work investigates whether K-FAC style heuristics contribute meaningfully to second-order optimisation by embedding them into Adam to create AdamQLR. By evaluating across regression and classification tasks with ASHA-based hyperparameter search, they observe that K-FAC adaptive heuristics show variable general effectiveness, while an untuned AdamQLR often matches tuned baselines in performance per runtime. The study highlights the potential of combining first-order update directions with second-order stability, but also reveals limitations where second-order heuristics may not generalise across tasks or scales. Overall, the results motivate further work to understand when such heuristics help and how to unify the strengths of first- and second-order methods.

Abstract

Research into optimisation for deep learning is characterised by a tension between the computational efficiency of first-order, gradient-based methods (such as SGD and Adam) and the theoretical efficiency of second-order, curvature-based methods (such as quasi-Newton methods and K-FAC). Noting that second-order methods often only function effectively with the addition of stabilising heuristics (such as Levenberg-Marquardt damping), we ask how much these (as opposed to the second-order curvature model) contribute to second-order algorithms' performance. We thus study AdamQLR: an optimiser combining damping and learning rate selection techniques from K-FAC (Martens & Grosse, 2015) with the update directions proposed by Adam, inspired by considering Adam through a second-order lens. We evaluate AdamQLR on a range of regression and classification tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC's adaptive heuristics are of variable standalone general effectiveness, and finding an untuned AdamQLR setting can achieve comparable performance vs runtime to tuned benchmarks.

Studying K-FAC Heuristics by Viewing Adam through a Second-Order Lens

TL;DR

This work investigates whether K-FAC style heuristics contribute meaningfully to second-order optimisation by embedding them into Adam to create AdamQLR. By evaluating across regression and classification tasks with ASHA-based hyperparameter search, they observe that K-FAC adaptive heuristics show variable general effectiveness, while an untuned AdamQLR often matches tuned baselines in performance per runtime. The study highlights the potential of combining first-order update directions with second-order stability, but also reveals limitations where second-order heuristics may not generalise across tasks or scales. Overall, the results motivate further work to understand when such heuristics help and how to unify the strengths of first- and second-order methods.

Abstract

Research into optimisation for deep learning is characterised by a tension between the computational efficiency of first-order, gradient-based methods (such as SGD and Adam) and the theoretical efficiency of second-order, curvature-based methods (such as quasi-Newton methods and K-FAC). Noting that second-order methods often only function effectively with the addition of stabilising heuristics (such as Levenberg-Marquardt damping), we ask how much these (as opposed to the second-order curvature model) contribute to second-order algorithms' performance. We thus study AdamQLR: an optimiser combining damping and learning rate selection techniques from K-FAC (Martens & Grosse, 2015) with the update directions proposed by Adam, inspired by considering Adam through a second-order lens. We evaluate AdamQLR on a range of regression and classification tasks at various scales and hyperparameter tuning methodologies, concluding K-FAC's adaptive heuristics are of variable standalone general effectiveness, and finding an untuned AdamQLR setting can achieve comparable performance vs runtime to tuned benchmarks.
Paper Structure (50 sections, 4 equations, 22 figures, 7 tables, 2 algorithms)

This paper contains 50 sections, 4 equations, 22 figures, 7 tables, 2 algorithms.

Figures (22)

  • Figure 1: Optimisation trajectories over 200 steps from a fixed initial point on the Rosenbrock Function. Hyperparameter tuning used 200 standard-normal random initial points.
  • Figure 2: Median training (left) and test (right) performance trajectories, bootstrap-sampled over 50 repetitions per algorithm. Hyperparameters chosen by ASHA over 200 initialisations. Note changes of scale on the time axes. See also results on loss metrics and learning rate evolutions in Figures \ref{['fig:AlgorithmLosses']} and \ref{['fig:AlgorithmLearningRates']}, and numerical comparison in Table \ref{['tab:EpochConstrainedFinalResults']}.
  • Figure 8: Sensitivity studies for AdamQLR on Fashion-MNIST over (a) learning rate rescaling, (b) batch size and (c) initial damping, showing test losses.
  • Figure 9: Median learning rate trajectories, bootstrap-sampled over 50 repetitions per algorithm. Hyperparameters chosen by ASHA over 200 initialisations. Note changes of scale on the time axes. See also our numerical presentation in Table \ref{['tab:EpochConstrainedFinalResults']}.
  • Figure 15: Median training (left) and test (right) loss trajectories, bootstrap-sampled over 50 repetitions per algorithm. Hyperparameters chosen by ASHA over 200 initialisations. Note changes of scale on the time axes. See also our numerical comparison in Table \ref{['tab:EpochConstrainedFinalResults']}.
  • ...and 17 more figures