Table of Contents
Fetching ...

A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences

Miguel González-Duque, Richard Michael, Simon Bartels, Yevgen Zainchkovskyy, Søren Hauberg, Wouter Boomsma

TL;DR

A unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology is developed.

Abstract

Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveals highly heterogeneous experimental set-ups across methods and technical barriers for the replicability and application of published algorithms to real-world tasks. To address these issues, we develop a unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology. These two components of the benchmark are each supported by flexible, scalable, and easily extendable software libraries (poli and poli-baselines), allowing practitioners to readily incorporate new optimization objectives or discrete optimizers. Project website: https://machinelearninglifescience.github.io/hdbo_benchmark

A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences

TL;DR

A unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology is developed.

Abstract

Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveals highly heterogeneous experimental set-ups across methods and technical barriers for the replicability and application of published algorithms to real-world tasks. To address these issues, we develop a unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology. These two components of the benchmark are each supported by flexible, scalable, and easily extendable software libraries (poli and poli-baselines), allowing practitioners to readily incorporate new optimization objectives or discrete optimizers. Project website: https://machinelearninglifescience.github.io/hdbo_benchmark
Paper Structure (41 sections, 5 figures, 3 tables)

This paper contains 41 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A timeline of high-dimensional Bayesian optimization methods, with arrows drawn between methods that explicitly augment or use each other. References can be found in supplementary Table \ref{['tab:appendix:full_taxonomy']}. The figure is inspired by Justesen:DL4VGP:2020. https://machinelearninglifescience.github.io/hdbo_benchmark/docs/hdbo/introduction/.
  • Figure 2: Existing BO methods tackle problems with insufficiently low effective dimensions. This figure shows sequence length and nr. of categories of the highest search space in the original tests. For reference, the discrete optimization problems usually tackled by practitioners in chemistry and biology are of the order of $10^2$ in sequence length, and $>10^1$ in nr. of categories. Methods that optimize directly in discrete space (e.g. BODi, ProbRep, Bounce; Sec. \ref{['sec:taxonomy:structured-spaces']}) are tested in lower sequence lengths and dictionary sizes; methods that rely on unsupervised information (e.g. LaMBO, etc.; Sec. \ref{['sec:taxonomy:non-linear-embeddings']}) are able to optimize more complex problems, like protein engineering or small molecule optimization.
  • Figure 3: Initialization, evaluation budget, and nr. of replications using different seeds reported in the experimental set-ups of several HDBO methods. We see heterogeneity in the evaluation of optimizers.
  • Figure 5: Overview of problem-space (x-axis) and how the categories act on the space. The black box ultimately maps from the discrete sequences of alphabet elements to a real value. The BO methods can act on the original (discrete) space, linear or non-linear mappings of it or selected variables of the input space or a mapping of it. These continuous versions can also accommodate one-hot representations.
  • Figure 6: poli's isolation process for complex environments