Table of Contents
Fetching ...

Critical Hyper-Parameters: No Random, No Cry

Olivier Bousquet, Sylvain Gelly, Karol Kurach, Olivier Teytaud, Damien Vincent

TL;DR

<3-5 sentence high-level summary>Hyperparameter optimization in deep learning is expensive and often parallelized via one-shot strategies. The paper argues that low-discrepancy sequences, especially Scrambled-Hammersley with random shift (S-SH), offer superior dispersion properties and robustness to irrelevant parameters, outperforming grid, random, and LHS in many settings. Theoretical results bind optimization error to dispersion and demonstrate favorable projections onto critical variables, while extensive experiments across language modeling and image classification confirm practical speed-ups and improved hyperparameter choices. The approach also serves effectively as a strong initialization for Bayesian optimization, suggesting a simple, drop-in replacement for standard HP search in DL pipelines.

Abstract

The selection of hyper-parameters is critical in Deep Learning. Because of the long training time of complex models and the availability of compute resources in the cloud, "one-shot" optimization schemes - where the sets of hyper-parameters are selected in advance (e.g. on a grid or in a random manner) and the training is executed in parallel - are commonly used. It is known that grid search is sub-optimal, especially when only a few critical parameters matter, and suggest to use random search instead. Yet, random search can be "unlucky" and produce sets of values that leave some part of the domain unexplored. Quasi-random methods, such as Low Discrepancy Sequences (LDS) avoid these issues. We show that such methods have theoretical properties that make them appealing for performing hyperparameter search, and demonstrate that, when applied to the selection of hyperparameters of complex Deep Learning models (such as state-of-the-art LSTM language models and image classification models), they yield suitable hyperparameters values with much fewer runs than random search. We propose a particularly simple LDS method which can be used as a drop-in replacement for grid or random search in any Deep Learning pipeline, both as a fully one-shot hyperparameter search or as an initializer in iterative batch optimization.

Critical Hyper-Parameters: No Random, No Cry

TL;DR

<3-5 sentence high-level summary>Hyperparameter optimization in deep learning is expensive and often parallelized via one-shot strategies. The paper argues that low-discrepancy sequences, especially Scrambled-Hammersley with random shift (S-SH), offer superior dispersion properties and robustness to irrelevant parameters, outperforming grid, random, and LHS in many settings. Theoretical results bind optimization error to dispersion and demonstrate favorable projections onto critical variables, while extensive experiments across language modeling and image classification confirm practical speed-ups and improved hyperparameter choices. The approach also serves effectively as a strong initialization for Bayesian optimization, suggesting a simple, drop-in replacement for standard HP search in DL pipelines.

Abstract

The selection of hyper-parameters is critical in Deep Learning. Because of the long training time of complex models and the availability of compute resources in the cloud, "one-shot" optimization schemes - where the sets of hyper-parameters are selected in advance (e.g. on a grid or in a random manner) and the training is executed in parallel - are commonly used. It is known that grid search is sub-optimal, especially when only a few critical parameters matter, and suggest to use random search instead. Yet, random search can be "unlucky" and produce sets of values that leave some part of the domain unexplored. Quasi-random methods, such as Low Discrepancy Sequences (LDS) avoid these issues. We show that such methods have theoretical properties that make them appealing for performing hyperparameter search, and demonstrate that, when applied to the selection of hyperparameters of complex Deep Learning models (such as state-of-the-art LSTM language models and image classification models), they yield suitable hyperparameters values with much fewer runs than random search. We propose a particularly simple LDS method which can be used as a drop-in replacement for grid or random search in any Deep Learning pipeline, both as a fully one-shot hyperparameter search or as an initializer in iterative batch optimization.

Paper Structure

This paper contains 32 sections, 6 theorems, 1 equation, 10 figures, 4 tables.

Key Result

Lemma 1

Let $\omega(f,x^*\delta)$ be the modulus of continuity of $f$ around $x^*$, $\omega(f,x^*,\delta)= \sup_{y:\|x^*-y\|\le \delta}|f(x^*)-f(y)|$. Then for any fixed sequence $S$, $|\min_{x\in S}f(x) - f(x^*)| \le \omega(f, x^*, disp(S))\,,$ and for any distribution $P$ over sequences, with probability

Figures (10)

  • Figure 1: Left: Summary of the properties for some of the considered sampling methods. S-SH and S-Ha have all the desired properties. Right: Pathological examples where various sampling algorithms will perform worse than Random. LHS can produce sequences that are aligned with the diagonal of the domain or that are completely off-diagonal with a higher probability than Random. This can be exploited by a function with high values on the diagonal and low values everywhere else. Grid: when the area with low values is thin and depends on one axis only, Grid is more likely to fail than Random. LDS (with or without random shift): as explained in Section \ref{['sec:path']}, if the function values are high around the points in the sequence and low otherwise, even with random shift, the performance will be worse than Random. Halton or Hammersley without scrambling: due to the sequential nature of the function $\gamma_q(k)$, the left part of the domain (lower values) is sampled more frequently than the right hand side.
  • Figure 2: Experiments on real-world language modeling, depending on the budget: we provide frequencies at which S-SH (resp. LHS) outperforms random and speedup interpretations. Left: Histogram of budgets used for comparing LHS, S-SH and Random. Right: Winning rate of S-SH and LHS compared to random, on language modeling tasks (PTB, UBPTB, MiniWiki) with various budgets.
  • Figure 3: Comparison between the average loss (bits per byte) for S-SH and for random on PTB-words (left) and PTB-words (right). Moving average (5 successives values) of the performance for setup C in Section \ref{['boringdetails']} (3 HPs).
  • Figure 4: Cifar10: test loss for random and S-SH as a function of the budget.
  • Figure 5: Language modeling with moderate networks, (x) in Table \ref{['table:main']}.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Definition 1: Volume Dispersion
  • Definition 2: Dispersion
  • Definition 3: Stochastic Dispersion
  • Lemma 1
  • Definition 4: Discrepancy
  • Theorem 1: rotetichy
  • Lemma 2: Relations between measures
  • Theorem 2: Asymptotic Rate
  • Remark 1
  • Theorem 3: Guaranteed Success
  • ...and 2 more