Table of Contents
Fetching ...

Large-scale Benchmarking of Metaphor-based Optimization Heuristics

Diederick Vermetten, Carola Doerr, Hao Wang, Anna V. Kononova, Thomas Bäck

TL;DR

This study addresses the challenge of evaluating a rapidly growing set of metaphor-based optimization heuristics by performing a large-scale benchmark of 294 implementations on the BBOB function suite across multiple dimensions. Using both anytime (AOCC) and fixed-budget (AOCClarge) performance metrics, the authors analyze how design choices, such as budget and baselines, influence comparisons and reveal substantial variability and complementarity within the algorithm portfolio. Key contributions include a comprehensive performance landscape, a Shapley-value assessment of algorithm contributions, and insights into how benchmarking data can guide algorithm selection, hybridization, and future research, all underscored by emphasis on reproducibility. The work highlights that many methods do not outperform random search on easy problems and demonstrates practical tools and methodologies to systematize fair, data-driven evaluation of optimization heuristics.

Abstract

The number of proposed iterative optimization heuristics is growing steadily, and with this growth, there have been many points of discussion within the wider community. One particular criticism that is raised towards many new algorithms is their focus on metaphors used to present the method, rather than emphasizing their potential algorithmic contributions. Several studies into popular metaphor-based algorithms have highlighted these problems, even showcasing algorithms that are functionally equivalent to older existing methods. Unfortunately, this detailed approach is not scalable to the whole set of metaphor-based algorithms. Because of this, we investigate ways in which benchmarking can shed light on these algorithms. To this end, we run a set of 294 algorithm implementations on the BBOB function suite. We investigate how the choice of the budget, the performance measure, or other aspects of experimental design impact the comparison of these algorithms. Our results emphasize why benchmarking is a key step in expanding our understanding of the algorithm space, and what challenges still need to be overcome to fully gauge the potential improvements to the state-of-the-art hiding behind the metaphors.

Large-scale Benchmarking of Metaphor-based Optimization Heuristics

TL;DR

This study addresses the challenge of evaluating a rapidly growing set of metaphor-based optimization heuristics by performing a large-scale benchmark of 294 implementations on the BBOB function suite across multiple dimensions. Using both anytime (AOCC) and fixed-budget (AOCClarge) performance metrics, the authors analyze how design choices, such as budget and baselines, influence comparisons and reveal substantial variability and complementarity within the algorithm portfolio. Key contributions include a comprehensive performance landscape, a Shapley-value assessment of algorithm contributions, and insights into how benchmarking data can guide algorithm selection, hybridization, and future research, all underscored by emphasis on reproducibility. The work highlights that many methods do not outperform random search on easy problems and demonstrates practical tools and methodologies to systematize fair, data-driven evaluation of optimization heuristics.

Abstract

The number of proposed iterative optimization heuristics is growing steadily, and with this growth, there have been many points of discussion within the wider community. One particular criticism that is raised towards many new algorithms is their focus on metaphors used to present the method, rather than emphasizing their potential algorithmic contributions. Several studies into popular metaphor-based algorithms have highlighted these problems, even showcasing algorithms that are functionally equivalent to older existing methods. Unfortunately, this detailed approach is not scalable to the whole set of metaphor-based algorithms. Because of this, we investigate ways in which benchmarking can shed light on these algorithms. To this end, we run a set of 294 algorithm implementations on the BBOB function suite. We investigate how the choice of the budget, the performance measure, or other aspects of experimental design impact the comparison of these algorithms. Our results emphasize why benchmarking is a key step in expanding our understanding of the algorithm space, and what challenges still need to be overcome to fully gauge the potential improvements to the state-of-the-art hiding behind the metaphors.
Paper Structure (7 sections, 1 equation, 8 figures)

This paper contains 7 sections, 1 equation, 8 figures.

Figures (8)

  • Figure 1: Cumulative Distribution of AOCC (default bounds) worse than $x$-axis value for all 294 algorithms in the portfolio. For example, in dimension 2, the star indicates that the fraction of algorithms with an AOCC below 0.6 is 0.65. AOCC values shown are aggregated over all 24 BBOB functions.
  • Figure 2: Normalized AOCC values per function for all 294 algorithms, ordered from dimensionality 2 (left) to 5, 10 and 20 (right). Color scales from dark blue=0 (worst) to yellow=1 (best). Larger versions of these figures are available on our Figshare repository reproducibility_and_figures. Colors denote the algorithm's library and algorithms are sorted by total AOCC over all functions.
  • Figure 3: Comparisons of algorithms performance to RandomSearch based on AOCC. An algorithm is considered worse on a function if its AOCC is at least 10% less than that of RandomSearch.
  • Figure 4: Left: Number of functions (out of 24) on which each algorithm is in the top 3 (on average AOCC). Middle: Average loss (absolute difference to best AOCC per function) over all 24 BBOB functions for each algorithm. Right: Contribution to an algorithm portfolio consisting of all baselines. Only algorithms which are considered competitive on at least one function are included, for a total of 43 unique algorithms.
  • Figure 5: Best algorithm for each (budget, function) combination in dimensionality 10, based on average precision with a cutoff at $10^{-8}$. Light cells with brown text indicate a tie, with the number in the cell indicating how many algorithms are tied. The number in brackets after the algorithm name indicates how often it occurs in the figure.
  • ...and 3 more figures