Table of Contents
Fetching ...

A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?

Moslem Noori, Elisabetta Valiante, Thomas Van Vaerenbergh, Masoud Mohseni, Ignacio Rozada

TL;DR

The paper addresses per-instance benchmarking of stochastic optimizers by rigorously modeling the uncertainty in repeat-based metrics. It links the per-run success probability $p$ to the target metrics $R_{99}$ and $ETT_{99}$, and introduces an adaptive-repeat strategy to achieve a user-specified accuracy. Key contributions include deriving a lower bound on the number of repeats required to bound the error on $p$, analyzing how CI methods impact $R_{99}$ and $ETT_{99}$, and validating the framework with data-driven simulations. The results enable reliable benchmarking and informed hyperparameter tuning for stochastic optimizers, preventing premature or overconfident conclusions about comparative performance.

Abstract

A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.

A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?

TL;DR

The paper addresses per-instance benchmarking of stochastic optimizers by rigorously modeling the uncertainty in repeat-based metrics. It links the per-run success probability to the target metrics and , and introduces an adaptive-repeat strategy to achieve a user-specified accuracy. Key contributions include deriving a lower bound on the number of repeats required to bound the error on , analyzing how CI methods impact and , and validating the framework with data-driven simulations. The results enable reliable benchmarking and informed hyperparameter tuning for stochastic optimizers, preventing premature or overconfident conclusions about comparative performance.

Abstract

A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.

Paper Structure

This paper contains 8 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Confidence interval of $R_{99}$ across different success probability with an error margin of $\varepsilon = 0.03$. The right plot is a zoom of the left plot.
  • Figure 2: Confidence interval of $R_{99}$ across different success probability with an error margin of $\varepsilon = 0.01$. The right plot is a zoom of the left plot.
  • Figure 3: Examples of 5-95% confidence intervals (dashed vertical lines) of the success probability, calculated using the Jeffreys' intervals. We consider the cases of $\ \hat{\mkern-3mu p} \in \{0.1,0.5,0.9\}$ (columns) and $N \in \{100, 1000, 1000\}$ (rows).
  • Figure 4: Examples of sampled success probability with $N=100$ (green histogram), compared with its expected distribution (red line) and 5-95% confidence intervals (dashed vertical lines). The KS test cannot reject the null hypthesis than the data are drown from the Beta distributions, as p-value $>0.05$) for all values of $p$.