Table of Contents
Fetching ...

Training neural networks faster with minimal tuning using pre-computed lists of hyperparameters for NAdamW

Sourabh Medapati, Priya Kasimbeg, Shankar Krishnan, Naman Agarwal, George Dahl

TL;DR

The paper tackles hyperparameter tuning under tight budgets by introducing precomputed ordered hyperparameter lists for the NAdamW optimizer, validated on the AlgoPerf benchmark. It proposes a broad, workload-agnostic search space and a greedy cost-minimization procedure to produce a 5-point list that generalizes to unseen base workloads and robustness-variant tasks. The main contributions include the 5-point (and larger) hyperparameter lists, a formal cost function for evaluating lists, and empirical evidence that the approach outperforms simple sweeps and off-the-shelf Bayesian optimization under the same budget. The results suggest a practical, turn-key tuning method that reduces tuning effort while maintaining performance across diverse architectures and datasets, with potential extensions to other optimizers and scaling scenarios.

Abstract

If we want to train a neural network using any of the most popular optimization algorithms, we are immediately faced with a dilemma: how to set the various optimization and regularization hyperparameters? When computational resources are abundant, there are a variety of methods for finding good hyperparameter settings, but when resources are limited the only realistic choices are using standard default values of uncertain quality and provenance, or tuning only a couple of the most important hyperparameters via extremely limited handdesigned sweeps. Extending the idea of default settings to a modest tuning budget, Metz et al. (2020) proposed using ordered lists of well-performing hyperparameter settings, derived from a broad hyperparameter search on a large library of training workloads. However, to date, no practical and performant hyperparameter lists that generalize to representative deep learning workloads have been demonstrated. In this paper, we present hyperparameter lists for NAdamW derived from extensive experiments on the realistic workloads in the AlgoPerf: Training Algorithms benchmark. Our hyperparameter lists also include values for basic regularization techniques (i.e. weight decay, label smoothing, and dropout). In particular, our best NAdamW hyperparameter list performs well on AlgoPerf held-out workloads not used to construct it, and represents a compelling turn-key approach to tuning when restricted to five or fewer trials. It also outperforms basic learning rate/weight decay sweeps and an off-the-shelf Bayesian optimization tool when restricted to the same budget.

Training neural networks faster with minimal tuning using pre-computed lists of hyperparameters for NAdamW

TL;DR

The paper tackles hyperparameter tuning under tight budgets by introducing precomputed ordered hyperparameter lists for the NAdamW optimizer, validated on the AlgoPerf benchmark. It proposes a broad, workload-agnostic search space and a greedy cost-minimization procedure to produce a 5-point list that generalizes to unseen base workloads and robustness-variant tasks. The main contributions include the 5-point (and larger) hyperparameter lists, a formal cost function for evaluating lists, and empirical evidence that the approach outperforms simple sweeps and off-the-shelf Bayesian optimization under the same budget. The results suggest a practical, turn-key tuning method that reduces tuning effort while maintaining performance across diverse architectures and datasets, with potential extensions to other optimizers and scaling scenarios.

Abstract

If we want to train a neural network using any of the most popular optimization algorithms, we are immediately faced with a dilemma: how to set the various optimization and regularization hyperparameters? When computational resources are abundant, there are a variety of methods for finding good hyperparameter settings, but when resources are limited the only realistic choices are using standard default values of uncertain quality and provenance, or tuning only a couple of the most important hyperparameters via extremely limited handdesigned sweeps. Extending the idea of default settings to a modest tuning budget, Metz et al. (2020) proposed using ordered lists of well-performing hyperparameter settings, derived from a broad hyperparameter search on a large library of training workloads. However, to date, no practical and performant hyperparameter lists that generalize to representative deep learning workloads have been demonstrated. In this paper, we present hyperparameter lists for NAdamW derived from extensive experiments on the realistic workloads in the AlgoPerf: Training Algorithms benchmark. Our hyperparameter lists also include values for basic regularization techniques (i.e. weight decay, label smoothing, and dropout). In particular, our best NAdamW hyperparameter list performs well on AlgoPerf held-out workloads not used to construct it, and represents a compelling turn-key approach to tuning when restricted to five or fewer trials. It also outperforms basic learning rate/weight decay sweeps and an off-the-shelf Bayesian optimization tool when restricted to the same budget.

Paper Structure

This paper contains 28 sections, 2 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of different tuning budgets vs proposed 5-point hyperparameter list for ImageNetResNet and Criteo 1TBDLRM workloads. The tuning efficiency gains vary from workload to workload with ImageNetResNet showing the highest gains as the 5-point hyperparameter list being as efficiency as broad search with 3x the tuning budget. For our Criteo 1TBDLRM and WMT workloads the 5-point hyperparameter list is as efficient as random search with same tuning budget. See Appendix \ref{['broad-search-vs-hyperparameter-list-appendix']} for results on other workloads.
  • Figure 2: left: number of left-out workloads trained successfully vs hyperparameter list size, right: mean $\mathcal{C}$ over left-out workloads vs hyperparameter list size.
  • Figure 3: Comparison of different tuning budgets vs proposed 5-point hyperparameter list for LibriSpeechConformer and LibriSpeechDeepSpeech workloads
  • Figure 4: Comparison of different tuning budgets vs proposed 5-point hyperparameter list for WMTTransformer and Criteo 1TBDLRMsmall workloads
  • Figure 5: Comparison of different tuning budgets vs proposed 5-point hyperparameter list for ImageNetViT and OGBGGNN workloads