Table of Contents
Fetching ...

Alpha Entropy Search for New Information-based Bayesian Optimization

Daniel Fernández-Sánchez, Eduardo C. Garrido-Merchán, Daniel Hernández-Lobato

TL;DR

This work provides an implementation of AES in BOTorch and evaluates its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network to show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.

Abstract

Bayesian optimization (BO) methods based on information theory have obtained state-of-the-art results in several tasks. These techniques heavily rely on the Kullback-Leibler (KL) divergence to compute the acquisition function. In this work, we introduce a novel information-based class of acquisition functions for BO called Alpha Entropy Search (AES). AES is based on the α-divergence, that generalizes the KL divergence. Iteratively, AES selects the next evaluation point as the one whose associated target value has the highest level of the dependency with respect to the location and associated value of the global maximum of the optimization problem. Dependency is measured in terms of the α-divergence, as an alternative to the KL divergence. Intuitively, this favors the evaluation of the objective function at the most informative points about the global maximum. The α-divergence has a free parameter α, which determines the behavior of the divergence, trading-off evaluating differences between distributions at a single mode, and evaluating differences globally. Therefore, different values of α result in different acquisition functions. AES acquisition lacks a closed-form expression. However, we propose an efficient and accurate approximation using a truncated Gaussian distribution. In practice, the value of α can be chosen by the practitioner, but here we suggest to use a combination of acquisition functions obtained by simultaneously considering a range of values of α. We provide an implementation of AES in BOTorch and we evaluate its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network. These experiments show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.

Alpha Entropy Search for New Information-based Bayesian Optimization

TL;DR

This work provides an implementation of AES in BOTorch and evaluates its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network to show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.

Abstract

Bayesian optimization (BO) methods based on information theory have obtained state-of-the-art results in several tasks. These techniques heavily rely on the Kullback-Leibler (KL) divergence to compute the acquisition function. In this work, we introduce a novel information-based class of acquisition functions for BO called Alpha Entropy Search (AES). AES is based on the α-divergence, that generalizes the KL divergence. Iteratively, AES selects the next evaluation point as the one whose associated target value has the highest level of the dependency with respect to the location and associated value of the global maximum of the optimization problem. Dependency is measured in terms of the α-divergence, as an alternative to the KL divergence. Intuitively, this favors the evaluation of the objective function at the most informative points about the global maximum. The α-divergence has a free parameter α, which determines the behavior of the divergence, trading-off evaluating differences between distributions at a single mode, and evaluating differences globally. Therefore, different values of α result in different acquisition functions. AES acquisition lacks a closed-form expression. However, we propose an efficient and accurate approximation using a truncated Gaussian distribution. In practice, the value of α can be chosen by the practitioner, but here we suggest to use a combination of acquisition functions obtained by simultaneously considering a range of values of α. We provide an implementation of AES in BOTorch and we evaluate its performance in both synthetic, benchmark and real-world experiments involving the tuning of the hyper-parameters of a deep neural network. These experiments show that the performance of AES is competitive with respect to other information-based acquisition functions such as JES, MES or PES.

Paper Structure

This paper contains 20 sections, 29 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: GP fit of the objective function (top of images) and the associated acquisition function (Expected Improvement garnett2023bayesian) built using the predictive distribution of the GP (down). Black points are observations are the red point is the last evaluation performed, given by the maximizer of the acquisition function in the previous iteration. We can see the BO process as iterations are carried out (from t=3 to t=5) and how the GP and the associated acquisition function guides the search for the optimum.
  • Figure 2: The Gaussian distribution $q(x)$ is fitted to $p(x)$ by minimizing Amari’s divergence with different values of $\alpha$. When $\alpha \rightarrow -\infty$, $\alpha$ tries to match one mode of $p(x)$, and as $\alpha$ increases, $q(x)$ starts covering more of the entire distribution. Finally, when $\alpha \rightarrow \infty$, $q(x)$ covers $p(x)$ entirely. Reproduced from minka2005divergence.
  • Figure 3: (bottom) Comparison of AES for different $\alpha$ values, JES and the ensemble acquisition function. We also display the maximum of each acquisition function. (top) Predictive distribution of the GP and generated samples of $\{\mathbf{x}^\star, y^\star\}$. The acquisition functions have been normalized so that the maximum is equal to one for a better visualization. Best viewed in color.
  • Figure 4: (top-left) GP predictive distribution for the objective. From (top-right) to (bottom-left) acquisition function AES, when using the proposed approximation, and using a method that is expected to give the exact acquisition. We report results for a representative set of $\alpha$ values. (bottom-right) Acquisition function for the ensemble method using the proposed approximation and a method that is expected to give the exact acquisition. Best viewed in color.
  • Figure 5: Average logarithm relative difference between the objective at each method's recommendation and the objective at the global maximum, with respect to the number of evaluations. Results are shown for the 4, 6, 8, and 12 dimensional problems. Observations are noiseless. Best viewed in color.
  • ...and 7 more figures