Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

Ana Nikolikj; Ana Kostovska; Gjorgjina Cenikj; Carola Doerr; Tome Eftimov

Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

Ana Nikolikj, Ana Kostovska, Gjorgjina Cenikj, Carola Doerr, Tome Eftimov

TL;DR

This work tackles the problem of generalizing feature-based performance predictors across benchmark suites by introducing a statistical framework that preserves high-dimensional information. It maps problem instances to a shared $n$-dimensional meta-feature space and uses the multivariate $\mathcal{E}$ test to compare training and testing distributions, linking cross-suite similarity to predictive transfer. Two experiments—one with standard BBOB/CEC suites and another with affine recombinations—show that when feature-landscape distributions are not statistically different, cross-suite predictive errors remain in the training error range, while significant differences forecast degraded accuracy. The study contributes a principled, information-preserving method for anticipating transferability of performance predictors and highlights the potential of combining this statistical view with traditional empirical coverage analyses to guide feature design and benchmark selection.

Abstract

This study examines the generalization ability of algorithm performance prediction models across various benchmark suites. Comparing the statistical similarity between the problem collections with the accuracy of performance prediction models that are based on exploratory landscape analysis features, we observe that there is a positive correlation between these two measures. Specifically, when the high-dimensional feature value distributions between training and testing suites lack statistical significance, the model tends to generalize well, in the sense that the testing errors are in the same range as the training errors. Two experiments validate these findings: one involving the standard benchmark suites, the BBOB and CEC collections, and another using five collections of affine combinations of BBOB problem instances.

Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

TL;DR

-dimensional meta-feature space and uses the multivariate

test to compare training and testing distributions, linking cross-suite similarity to predictive transfer. Two experiments—one with standard BBOB/CEC suites and another with affine recombinations—show that when feature-landscape distributions are not statistically different, cross-suite predictive errors remain in the training error range, while significant differences forecast degraded accuracy. The study contributes a principled, information-preserving method for anticipating transferability of performance predictors and highlights the potential of combining this statistical view with traditional empirical coverage analyses to guide feature design and benchmark selection.

Abstract

Paper Structure (9 sections, 1 equation, 4 figures, 4 tables)

This paper contains 9 sections, 1 equation, 4 figures, 4 tables.

Introduction
Related work
Statistical measure for accessing similarity of benchmark suites
Experimental design
Results and discussion
First experiment
Second experiment
Discussion
Conclusions

Figures (4)

Figure 1: Heatmap showing the MDAE of an RF model when predicting the performance of a) CMA, b) DE, and c) PSO, on BBOB, CEC2013, CEC2014, CEC2015, and CEC2017. Rows indicate the training benchmark suite and columns indicate the benchmark suite of the model was evaluated on.
Figure 2: Box-plots showing the AE (Absolute error) of an RF model when predicting the performance of a-d) CMA, e-h) DE, and i-l) PSO. Subplot titles name the training benchmark suite, with one box plot showing train AEs and others depicting corresponding test AEs.
Figure 3: Heatmap that visualizes the p-values obtained by comparing an algorithm's performance distributions among pairs of benchmark suites (a) CMA, b) DE c) PSO). Rows and columns depict benchmark suites in paired comparisons. Upper triangle heatmaps show symmetry. A two-sample Kolmogorov-Smirnov test (p-value < 0.05) indicates significant differences in algorithm performance between benchmark suites.
Figure 4: Heatmap showing the MDAE of an RF model when predicting the performance of a) CMA, b) DE, and c) PSO on the benchmark suites sampled from the affine problems. Rows indicate the training benchmark suite and columns indicate the benchmark suite of the model was evaluated on.

Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

TL;DR

Abstract

Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)