Critical Hyper-Parameters: No Random, No Cry
Olivier Bousquet, Sylvain Gelly, Karol Kurach, Olivier Teytaud, Damien Vincent
TL;DR
<3-5 sentence high-level summary>Hyperparameter optimization in deep learning is expensive and often parallelized via one-shot strategies. The paper argues that low-discrepancy sequences, especially Scrambled-Hammersley with random shift (S-SH), offer superior dispersion properties and robustness to irrelevant parameters, outperforming grid, random, and LHS in many settings. Theoretical results bind optimization error to dispersion and demonstrate favorable projections onto critical variables, while extensive experiments across language modeling and image classification confirm practical speed-ups and improved hyperparameter choices. The approach also serves effectively as a strong initialization for Bayesian optimization, suggesting a simple, drop-in replacement for standard HP search in DL pipelines.
Abstract
The selection of hyper-parameters is critical in Deep Learning. Because of the long training time of complex models and the availability of compute resources in the cloud, "one-shot" optimization schemes - where the sets of hyper-parameters are selected in advance (e.g. on a grid or in a random manner) and the training is executed in parallel - are commonly used. It is known that grid search is sub-optimal, especially when only a few critical parameters matter, and suggest to use random search instead. Yet, random search can be "unlucky" and produce sets of values that leave some part of the domain unexplored. Quasi-random methods, such as Low Discrepancy Sequences (LDS) avoid these issues. We show that such methods have theoretical properties that make them appealing for performing hyperparameter search, and demonstrate that, when applied to the selection of hyperparameters of complex Deep Learning models (such as state-of-the-art LSTM language models and image classification models), they yield suitable hyperparameters values with much fewer runs than random search. We propose a particularly simple LDS method which can be used as a drop-in replacement for grid or random search in any Deep Learning pipeline, both as a fully one-shot hyperparameter search or as an initializer in iterative batch optimization.
