Table of Contents
Fetching ...

Information-theoretic Bayesian Optimization: Survey and Tutorial

Eduardo C. Garrido-Merchán

TL;DR

The paper surveys information-theoretic acquisition functions for Bayesian optimization, addressing the problem of efficiently optimizing expensive, noisy black-box objectives $f:\mathcal{X}\to\mathbb{R}$ with unknown gradients. It introduces core information-theoretic quantities such as entropy $H(\cdot)$ and mutual information $I(\cdot;\cdot)$ and shows how acquisitions like $I(\mathbf{x};\mathbf{x}^*)$ or $H(\mathbf{x}^*|\mathcal{D})$ guide the next evaluation. The survey traces the evolution from IAGO and Entropy Search to Predictive Entropy Search, Max-value Entropy Search, and joint/alpha-divergence variants, detailing adaptations to constrained, multi-objective, multi-fidelity, and batch settings, along with practical approximations (e.g., expectation propagation) to keep computations tractable. It also discusses practical implementations, performance trade-offs, and future research directions for robust, scalable information-theoretic Bayesian optimization.

Abstract

Several scenarios require the optimization of non-convex black-box functions, that are noisy expensive to evaluate functions with unknown analytical expression, whose gradients are hence not accessible. For example, the hyper-parameter tuning problem of machine learning models. Bayesian optimization is a class of methods with state-of-the-art performance delivering a solution to this problem in real scenarios. It uses an iterative process that employs a probabilistic surrogate model, typically a Gaussian process, of the objective function to be optimized computing a posterior predictive distribution of the black-box function. Based on the information given by this posterior predictive distribution, Bayesian optimization includes the computation of an acquisition function that represents, for every input space point, the utility of evaluating that point in the next iteraiton if the objective of the process is to retrieve a global extremum. This paper is a survey of the information theoretical acquisition functions, whose performance typically outperforms the rest of acquisition functions. The main concepts of the field of information theory are also described in detail to make the reader aware of why information theory acquisition functions deliver great results in Bayesian optimization and how can we approximate them when they are intractable. We also cover how information theory acquisition functions can be adapted to complex optimization scenarios such as the multi-objective, constrained, non-myopic, multi-fidelity, parallel and asynchronous settings and provide further lines of research.

Information-theoretic Bayesian Optimization: Survey and Tutorial

TL;DR

The paper surveys information-theoretic acquisition functions for Bayesian optimization, addressing the problem of efficiently optimizing expensive, noisy black-box objectives with unknown gradients. It introduces core information-theoretic quantities such as entropy and mutual information and shows how acquisitions like or guide the next evaluation. The survey traces the evolution from IAGO and Entropy Search to Predictive Entropy Search, Max-value Entropy Search, and joint/alpha-divergence variants, detailing adaptations to constrained, multi-objective, multi-fidelity, and batch settings, along with practical approximations (e.g., expectation propagation) to keep computations tractable. It also discusses practical implementations, performance trade-offs, and future research directions for robust, scalable information-theoretic Bayesian optimization.

Abstract

Several scenarios require the optimization of non-convex black-box functions, that are noisy expensive to evaluate functions with unknown analytical expression, whose gradients are hence not accessible. For example, the hyper-parameter tuning problem of machine learning models. Bayesian optimization is a class of methods with state-of-the-art performance delivering a solution to this problem in real scenarios. It uses an iterative process that employs a probabilistic surrogate model, typically a Gaussian process, of the objective function to be optimized computing a posterior predictive distribution of the black-box function. Based on the information given by this posterior predictive distribution, Bayesian optimization includes the computation of an acquisition function that represents, for every input space point, the utility of evaluating that point in the next iteraiton if the objective of the process is to retrieve a global extremum. This paper is a survey of the information theoretical acquisition functions, whose performance typically outperforms the rest of acquisition functions. The main concepts of the field of information theory are also described in detail to make the reader aware of why information theory acquisition functions deliver great results in Bayesian optimization and how can we approximate them when they are intractable. We also cover how information theory acquisition functions can be adapted to complex optimization scenarios such as the multi-objective, constrained, non-myopic, multi-fidelity, parallel and asynchronous settings and provide further lines of research.

Paper Structure

This paper contains 11 sections, 27 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Gaussian process posterior distribution of an objective function and the associated acquisition function (Expected Improvement garnett2023bayesian) whose maximum represents the following suggestion of the Bayesian optimization procedure. We can see how a better prediction incur in a higher value of the acquisition, that is even higher if the associated uncertainty is higher. However uncertainty on the left is not taken into account by Expected improvement, and the optimum may be there.
  • Figure 2: Shannon information content (blue) and probabilities (yellow) associated with the 5 events of a discrete random variable. If probabilities are lower, the surprise of the event is higher, represented by the Shannon information content and viceversa.
  • Figure 3: Entropy of two discrete random variables. The entropy is the expectation of the surprise of the events of a random variable whose probability is given by the mass probability function. The random variable of the left has higher entropy than the one on the right. This means that we have more certainty about the values of the random variable of the right, as the lowest one is more likely to happen whereas on the left we do not know which will be the next outcome. Consequently, we require less bits to encode its values. This will be used for information theoretical Bayesian optimization approaches as we will discuss further.
  • Figure 4: Empirical distribution of the minimizer $p(\mathbf{x}^\star)$ of a black-box function according to the information given by the posterior distribution of the conditioned Gaussian process on previous observations. Each sample consists on a path of the Gaussian process model that is optimized to obtain a sample of the minimum of the problem given the current information.
  • Figure 5: Visual example of the distribution of the optimum value of the problem given the information provided by the predictive distribution of the Gaussian process conditioned on observations $p(f^\star|\mathcal{D})$.
  • ...and 3 more figures