Table of Contents
Fetching ...

Hyperparameter Optimization in Machine Learning

Luca Franceschi, Michele Donini, Valerio Perrone, Aaron Klein, Cédric Archambeau, Matthias Seeger, Massimiliano Pontil, Paolo Frasconi

TL;DR

This survey articulates hyperparameter optimization as a structured, repeatable process essential to modern ML performance. It categorizes the main families of HPO approaches—elementary grid/random/ quasi-random methods, model-based Bayesian optimization, multi-fidelity strategies, population-based algorithms, and gradient-based hypergradients—illuminating their trade-offs, parallelizability, and practical considerations. It further surveys extended topics such as multi-objective and constrained HPO, neural architecture search, meta-learning, and transfer across model scales, and reviews HPO systems and benchmarking ecosystems. The work concludes with open questions and directions, emphasizing reproducibility, efficiency, and applicability to large-scale foundation models and unsupervised settings, underscoring the practical impact on automated, scalable ML development.

Abstract

Hyperparameters are configuration variables controlling the behavior of machine learning algorithms. They are ubiquitous in machine learning and artificial intelligence and the choice of their values determines the effectiveness of systems based on these technologies. Manual hyperparameter search is often time-consuming and becomes infeasible when the number of hyperparameters is large. Automating the search is an important step towards advancing, streamlining, and systematizing machine learning, freeing researchers and practitioners alike from the burden of finding a good set of hyperparameters by trial and error. In this survey, we present a unified treatment of hyperparameter optimization, providing the reader with examples, insights into the state-of-the-art, and numerous links to further reading. We cover the main families of techniques to automate hyperparameter search, often referred to as hyperparameter optimization or tuning, including random and quasi-random search, bandit-, model-, population-, and gradient-based approaches. We further discuss extensions, including online, constrained, and multi-objective formulations, touch upon connections with other fields, such as meta-learning and neural architecture search, and conclude with open questions and future research directions.

Hyperparameter Optimization in Machine Learning

TL;DR

This survey articulates hyperparameter optimization as a structured, repeatable process essential to modern ML performance. It categorizes the main families of HPO approaches—elementary grid/random/ quasi-random methods, model-based Bayesian optimization, multi-fidelity strategies, population-based algorithms, and gradient-based hypergradients—illuminating their trade-offs, parallelizability, and practical considerations. It further surveys extended topics such as multi-objective and constrained HPO, neural architecture search, meta-learning, and transfer across model scales, and reviews HPO systems and benchmarking ecosystems. The work concludes with open questions and directions, emphasizing reproducibility, efficiency, and applicability to large-scale foundation models and unsupervised settings, underscoring the practical impact on automated, scalable ML development.

Abstract

Hyperparameters are configuration variables controlling the behavior of machine learning algorithms. They are ubiquitous in machine learning and artificial intelligence and the choice of their values determines the effectiveness of systems based on these technologies. Manual hyperparameter search is often time-consuming and becomes infeasible when the number of hyperparameters is large. Automating the search is an important step towards advancing, streamlining, and systematizing machine learning, freeing researchers and practitioners alike from the burden of finding a good set of hyperparameters by trial and error. In this survey, we present a unified treatment of hyperparameter optimization, providing the reader with examples, insights into the state-of-the-art, and numerous links to further reading. We cover the main families of techniques to automate hyperparameter search, often referred to as hyperparameter optimization or tuning, including random and quasi-random search, bandit-, model-, population-, and gradient-based approaches. We further discuss extensions, including online, constrained, and multi-objective formulations, touch upon connections with other fields, such as meta-learning and neural architecture search, and conclude with open questions and future research directions.

Paper Structure

This paper contains 104 sections, 94 equations, 14 figures, 1 table, 3 algorithms.

Figures (14)

  • Figure 1: Results on sentiment analysis reported in yogotama2015-emnlp15. Hyperparameters and their domains (both continuous and discrete) are listed on the left. Here $[n_{min},n_{max}]$ denotes the $n$-gram range. SVM: support vector machine; LR: logistic regression; NN: multi-layer perceptron; CNN: convolutional neural network.
  • Figure 2: Geometrical representations of the $L^2$ (left) and $L^1$ (right) regularizers. $w^*_0$ represents the minimizer of the training loss $\mathcal{J}_{\lambda}$ when the regularization coefficient $\lambda^{(6)}$ is $0$ (i.e. no regularization). $w^*_{\rho_1}$ and $w^*_{\rho_2}$ are, instead, the minimizers of the regularized losses for two different values of $\lambda^{(6)}=\rho_i$ for $i\in \{1, 2\}$, with $0 < \rho_1<\rho_2$. These correspond to the projections of $w^*_0$ onto $L^2$ and $L^1$ balls with varying radii that are inversely proportional to $\lambda^{(6)}$.
  • Figure 3: Three examples of response functions. From left to right the underlying algorithms are: ridge regression, two-layers neural network regression, Monte Carlo tree search chen_learning_2017.
  • Figure 4: Non-adaptive sampling methods on the unit square for different sample sizes: 4 (left), 9 (center) and 16 (right) points. Each row represents a different schema. Grid and Latin hypercube sampling (LHS) require re-sampling every time we increase the sample size. Instead, for random and Sobol' we can keep on adding points, "reusing" previous samples. For each plot, we also show projections of the samples onto the two axes. We can see how grid samples cover much fewer points on the projections w.r.t. the other sampling methods. We can also observe the "unlucky run" phenomenon of (uniform) random sampling, where after nine draws the upper-right quadrant is still unvisited. LHS and Sobol' (with random shifts and scrambling) provide more even coverage of the space and its unidimensional projections.
  • Figure 5: Left: Suppose we have measured the response function (plotted in red) on a small set of points. Fitting a Gaussian process on these points yields a mean function (plotted in black) with and a variance function ($\pm$ one standard deviation is filled in gray). If $\lambda$ falls between 7 and 11, the uncertainty is very high, which yields the right bump in the expected improvement (plotted in green), allowing us to explore a region where potentially a smallest response might be found. The left bump of the expected improvement falls in an exploitation region, which in this example looses against exploration. Hence, the next candidate is $\lambda= 8.5$. Right: once the response is measured at the new candidate, the Gaussian process is trained again. Now uncertainty shrinks significantly and the new expected improvement has just one bump around 2.7, which is already very close to the minimum of the true response. This example takes inspiration from Figure 11 in Jones1998:Efficient.
  • ...and 9 more figures