Table of Contents
Fetching ...

Mind the Gap: Measuring Generalization Performance Across Multiple Objectives

Matthias Feurer, Katharina Eggensperger, Edward Bergman, Florian Pfisterer, Bernd Bischl, Frank Hutter

TL;DR

Mind the Gap addresses how to measure generalization of multi-objective hyperparameter optimization beyond the validation set. It introduces optimistic and pessimistic Pareto fronts, defined via test-set dominance $\prec_{test}$ on the validation-derived front, and uses the resulting hypervolume gap as a robustness metric. The paper formalizes the evaluation protocol, demonstrates the existence of the approximation gap in experiments, and shows that these notions enable reliable comparisons between two MHPO algorithms. This framework provides a practical tool for robust multi-objective model selection and generalization assessment across domains such as NAS and AutoML.

Abstract

Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.

Mind the Gap: Measuring Generalization Performance Across Multiple Objectives

TL;DR

Mind the Gap addresses how to measure generalization of multi-objective hyperparameter optimization beyond the validation set. It introduces optimistic and pessimistic Pareto fronts, defined via test-set dominance on the validation-derived front, and uses the resulting hypervolume gap as a robustness metric. The paper formalizes the evaluation protocol, demonstrates the existence of the approximation gap in experiments, and shows that these notions enable reliable comparisons between two MHPO algorithms. This framework provides a practical tool for robust multi-objective model selection and generalization assessment across domains such as NAS and AutoML.

Abstract

Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.
Paper Structure (10 sections, 3 equations, 4 figures, 1 table)

This paper contains 10 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: We visualize validation (orange) and test performance (green) of the Pareto set, as found on the validation data set. Considering test performance, (a) all configurations are non-dominated, (b) the configurations are still Pareto-optimal, but switch order, and (c) the configurations are no longer Pareto-optimal.
  • Figure 2: We visualize validation (orange) and test performance (green/pink) of the Pareto set, as found on the validation data set. Left: We show that ignoring dominated points on the test set leads to an overestimation of the hypervolume indicator. Middle: We show how adversarial MHPO can return points that lead to an increased hypervolume on the test data. Right: We show our proposed optimistic Pareto-set (green), pessimistic Pareto-set (pink), and the approximation gap between the optimistic and pessimistic Pareto-set (pink area).
  • Figure 3: Precision vs Recall. The left plot focuses on the validation error, the middle plot depicts the test error of points from the Pareto set on the validation set, and the right-hand-side plot depicts the approximation of the optimistic and the pessimistic Pareto sets.
  • Figure 4: Optimistic and pessimistic Pareto fronts for the random forest (left), the linear model (middle), and both (right) after 200 iterations of random search. For both models, we plot the pessimistic Pareto front in a darker color and using circle markers and the optimistic Pareto front in a lighter color and using star markers; and we use the same colors in the plot on the right-hand side. Furthermore, in the left and middle plots, we also give the validation Pareto front in light orange (similar to Figure \ref{['fig:experiment']}).