Table of Contents
Fetching ...

EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions

Laurens Bliek, Arthur Guijt, Rickard Karlsson, Sicco Verwer, Mathijs de Weerdt

TL;DR

EXPObench addresses the lack of standardised benchmarking for surrogate-based optimisation on expensive black-box functions by introducing a public benchmark library that evaluates six algorithms on four real-world problems (Windwake, Pitzdaily, ESP, HPO). It provides a coherent experimental framework, a public dataset of evaluation points and runtimes, and practical rules of thumb for algorithm selection. The study reveals that exploration and objective evaluation time often dominate algorithm performance, sometimes outweighing surrogate model accuracy, and highlights cross-domain effectiveness of discrete models on continuous problems. This work enables more uniform benchmarking, accelerates method development with a reusable data resource, and guides practitioners toward informed algorithm choices under different cost and budget constraints.

Abstract

Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the performance of the six algorithms on all evaluated problem instances. This results in a unique new dataset that lowers the bar for researching new methods as the number of expensive evaluations required for comparison is significantly reduced.

EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions

TL;DR

EXPObench addresses the lack of standardised benchmarking for surrogate-based optimisation on expensive black-box functions by introducing a public benchmark library that evaluates six algorithms on four real-world problems (Windwake, Pitzdaily, ESP, HPO). It provides a coherent experimental framework, a public dataset of evaluation points and runtimes, and practical rules of thumb for algorithm selection. The study reveals that exploration and objective evaluation time often dominate algorithm performance, sometimes outweighing surrogate model accuracy, and highlights cross-domain effectiveness of discrete models on continuous problems. This work enables more uniform benchmarking, accelerates method development with a reusable data resource, and guides practitioners toward informed algorithm choices under different cost and budget constraints.

Abstract

Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the performance of the six algorithms on all evaluated problem instances. This results in a unique new dataset that lowers the bar for researching new methods as the number of expensive evaluations required for comparison is significantly reduced.

Paper Structure

This paper contains 31 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Results on the different benchmark problems, averaged over $T$ runs, after starting with $R$ random samples. $T$ is varied due to the different degrees of expensiveness of the problems. The shaded area indicates one standard deviation, the horizontal axis indicates the iteration of the algorithm, and all figures use the legend shown in the middle. The computation time on the right does not contain the time it takes to evaluate the objective. The benchmark problems are: (a) wind farm layout optimisation, $10$ continuous variables, $T=10$, $R=20$; (b) Pitzdaily, $10$ continuous variables, $T=5$, $R=20$; (c) electrostatic precipitator, $49$ discrete variables, $T=7$, $R=24$; (d) simultaneous hyperparameter tuning and preprocessing for XGBoost, $117$ categorical, $7$ integer, $11$ continuous variables, $T=10$, $R=300$.
  • Figure 2: The best surrogate algorithm for the case that the evaluation time of the objective is artificially changed (vertical axis), and for different time budgets (horizontal axis). The different marker shapes indicate which of the surrogate algorithms achieved the best objective value, while the colour shows the corresponding objective value (not normalised, lower is better). The regions divided by black lines show which algorithm would perform best according to a decision tree trained on the data. Other black-box optimisation algorithms such as population-based methods are expected to dominate the empty bottom right region, where the time budget is large but the function evaluation time is small.