Table of Contents
Fetching ...

Confidence Interval Estimation of Predictive Performance in the Context of AutoML

Konstantinos Paraschakis, Andrea Castellani, Giorgos Borboudakis, Ioannis Tsamardinos

TL;DR

This work addresses the challenge of quantifying uncertainty in AutoML-predicted out-of-sample performance by evaluating nine CI-estimation methods and introducing BBC-F, a fold-based variant of Bootstrap Bias Correction. BBC and BBC-F consistently yield accurate, tight confidence intervals and robust coverage, while BBC-F dramatically reduces computation compared to BBC. The study combines real OpenML datasets with controlled simulations to demonstrate superiority over competing methods, including NB and 10p variants. The findings provide practitioners with a reliable, efficient approach for reporting predictive performance uncertainty in AutoML pipelines, with potential for extension to other tasks and metrics.

Abstract

Any supervised machine learning analysis is required to provide an estimate of the out-of-sample predictive performance. However, it is imperative to also provide a quantification of the uncertainty of this performance in the form of a confidence or credible interval (CI) and not just a point estimate. In an AutoML setting, estimating the CI is challenging due to the ``winner's curse", i.e., the bias of estimation due to cross-validating several machine learning pipelines and selecting the winning one. In this work, we perform a comparative evaluation of 9 state-of-the-art methods and variants in CI estimation in an AutoML setting on a corpus of real and simulated datasets. The methods are compared in terms of inclusion percentage (does a 95\% CI include the true performance at least 95\% of the time), CI tightness (tighter CIs are preferable as being more informative), and execution time. The evaluation is the first one that covers most, if not all, such methods and extends previous work to imbalanced and small-sample tasks. In addition, we present a variant, called BBC-F, of an existing method (the Bootstrap Bias Correction, or BBC) that maintains the statistical properties of the BBC but is more computationally efficient. The results support that BBC-F and BBC dominate the other methods in all metrics measured.

Confidence Interval Estimation of Predictive Performance in the Context of AutoML

TL;DR

This work addresses the challenge of quantifying uncertainty in AutoML-predicted out-of-sample performance by evaluating nine CI-estimation methods and introducing BBC-F, a fold-based variant of Bootstrap Bias Correction. BBC and BBC-F consistently yield accurate, tight confidence intervals and robust coverage, while BBC-F dramatically reduces computation compared to BBC. The study combines real OpenML datasets with controlled simulations to demonstrate superiority over competing methods, including NB and 10p variants. The findings provide practitioners with a reliable, efficient approach for reporting predictive performance uncertainty in AutoML pipelines, with potential for extension to other tasks and metrics.

Abstract

Any supervised machine learning analysis is required to provide an estimate of the out-of-sample predictive performance. However, it is imperative to also provide a quantification of the uncertainty of this performance in the form of a confidence or credible interval (CI) and not just a point estimate. In an AutoML setting, estimating the CI is challenging due to the ``winner's curse", i.e., the bias of estimation due to cross-validating several machine learning pipelines and selecting the winning one. In this work, we perform a comparative evaluation of 9 state-of-the-art methods and variants in CI estimation in an AutoML setting on a corpus of real and simulated datasets. The methods are compared in terms of inclusion percentage (does a 95\% CI include the true performance at least 95\% of the time), CI tightness (tighter CIs are preferable as being more informative), and execution time. The evaluation is the first one that covers most, if not all, such methods and extends previous work to imbalanced and small-sample tasks. In addition, we present a variant, called BBC-F, of an existing method (the Bootstrap Bias Correction, or BBC) that maintains the statistical properties of the BBC but is more computationally efficient. The results support that BBC-F and BBC dominate the other methods in all metrics measured.
Paper Structure (6 sections, 2 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 6 sections, 2 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: A schematic depiction of the BBC-F algorithm. CVT leads to an out-of-sample prediction matrix $\Pi$. $\Pi$ is used to compute the matrix $\Pi'$, containing the performance $p_{ij}$ for each configuration $m_i$ on each test fold $f_j$. The winning configuration $J^*$ is used to create the final model $M$ of CVT. Next, $\Pi'$ is bootstrapped w.r.t. to rows. In each bootstrap iteration, the winning configuration $J^b$ is selected (displayed as plain $J$ in the figure as it is itself a subscript at various places) based on the average in-bag performances. Its average performance $L_b$ on the out-of-bag performances is stored. The distribution of $\{L_1, \ldots, L_B\}$ is used to provide a point estimate (the mean of the distribution) and a CI.
  • Figure 2: Relative tightness ratio to BBC per dataset (closer to 1 is more similar to BBC). Blue/orange dots correspond to non-rejected/rejected inclusion percentages.
  • Figure 3: Time-complexity analysis for BBC-P and BBC-F wrt. (left) data samples, (mid) number of model configurations, (right) number of folds.