Scrap Your Schedules with PopDescent

Abhinav Pomalapally; Bassel El Mabsout; Renato Mansuco

Scrap Your Schedules with PopDescent

Abhinav Pomalapally, Bassel El Mabsout, Renato Mansuco

TL;DR

This work tackles the inefficiency and rigidity of fixed hyper-parameter schedules by introducing PopDescent, a progress-aware memetic algorithm for hyper-parameter search that actively leverages training progress via a cross-validation fitness proxy. It blends an $m$-elitist evolutionary strategy with gradient-based local search, mutating hyper-parameters and weights with Gaussian noise while selecting the next generation based on normalized fitness. Across FMNIST, CIFAR-10, and CIFAR-100 vision benchmarks, PopDescent attains faster convergence and up to 18% lower test loss than baselines including grids, Bayesian/random searches, schedules, and ESGD. The method demonstrates robustness to initialization and reduces the need for tuning its own hyper-parameters, aided by a simple, openly shared TensorFlow 2 reference implementation.

Abstract

In contemporary machine learning workloads, numerous hyper-parameter search algorithms are frequently utilized to efficiently discover high-performing hyper-parameter values, such as learning and regularization rates. As a result, a range of parameter schedules have been designed to leverage the capability of adjusting hyper-parameters during training to enhance loss performance. These schedules, however, introduce new hyper-parameters to be searched and do not account for the current loss values of the models being trained. To address these issues, we propose Population Descent (PopDescent), a progress-aware hyper-parameter tuning technique that employs a memetic, population-based search. By merging evolutionary and local search processes, PopDescent proactively explores hyper-parameter options during training based on their performance. Our trials on standard machine learning vision tasks show that PopDescent converges faster than existing search methods, finding model parameters with test-loss values up to 18% lower, even when considering the use of schedules. Moreover, we highlight the robustness of PopDescent to its initial training parameters, a crucial characteristic for hyper-parameter search techniques.

Scrap Your Schedules with PopDescent

TL;DR

-elitist evolutionary strategy with gradient-based local search, mutating hyper-parameters and weights with Gaussian noise while selecting the next generation based on normalized fitness. Across FMNIST, CIFAR-10, and CIFAR-100 vision benchmarks, PopDescent attains faster convergence and up to 18% lower test loss than baselines including grids, Bayesian/random searches, schedules, and ESGD. The method demonstrates robustness to initialization and reduces the need for tuning its own hyper-parameters, aided by a simple, openly shared TensorFlow 2 reference implementation.

Abstract

Paper Structure (23 sections, 1 equation, 2 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 1 equation, 2 figures, 5 tables, 2 algorithms.

Introduction
Population Descent
Algorithm Definition
Key points in $\textsc{PopDescent}\xspace$'s design
Limitations
Evaluations
Benchmarks
Competing Algorithms
Discussion
Convergence
Ablation Study
Hyper-parameter Sensitivity
Related works
Conclusion
Reproducibility Statement
...and 8 more sections

Figures (2)

Figure 1: Benchmark Tests. The best test loss achieved by each method (the lower plots), plotted with standard deviation across 10 seeded trials; we show how many gradient steps each method takes to converge (the top plots). Each column of plots represents one vision benchmark, and we compare all methods' performance with and without a regularization term on each benchmark. $\textsc{PopDescent}\xspace$ (red bar with circles) achieves the lowest test loss in each problem. Table with quantitative data provided in the Appendix in Table \ref{['table:Bench']}.
Figure 2: Convergence. A comparison of hyper-optimizers showcasing their validation loss progress against the number of gradient steps taken on the FMNIST dataset. Each algorithm's exponential moving average across six random seeds is plotted with standard deviation. They are tuning the learning rate (without regularization). Note that Sklearn Hyperopt, KT RandomSearch, and KT Scheduling remain flat until 46k gradient steps. Their tuning process evaluates each hyperparameter combination on two epochs at a time; real training occurs after this search (post 46k steps). See Table \ref{['table:Convergence']} for quantitative results.

Scrap Your Schedules with PopDescent

TL;DR

Abstract

Scrap Your Schedules with PopDescent

Authors

TL;DR

Abstract

Table of Contents

Figures (2)