Table of Contents
Fetching ...

Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions

Christoph Lehmann, Yahor Paromau

TL;DR

The paper reframes ML model evaluation as a study of performance metric distributions rather than single-point estimates, driven by sources of variation such as seed initialization and hyperparameter search. It advocates quantile-based analysis and confidence intervals for quantiles to quantify uncertainty, offering three nonparametric CI methods (exact, asymptotic, semiparametric bootstrap) plus a $t$-interval benchmark for the mean. Through simulations and real-data use cases (classification and regression), it shows that nonparametric exact CIs deliver strong coverage, middle quantiles are estimated more reliably, and semiparametric bootstrap can assist in very small samples though tail behavior may be challenging; practical guidance suggests $n\in[15,25]$ suffices for quantiles up to 90%. Overall, the framework enables robust, distribution-aware comparisons of training configurations and can inform data quality improvements and decision-making under uncertainty in ML pipelines.

Abstract

Machine learning models are widely used in applications where reliability and robustness are critical. Model evaluation often relies on single-point estimates of performance metrics such as accuracy, F1 score, or mean squared error, that fail to capture the inherent variability in model performance. This variability arises from multiple sources, including train-test split, weights initialization, and hyperparameter tuning. Investigating the characteristics of performance metric distributions, rather than focusing on a single point only, is essential for informed decision-making during model selection and optimization, especially in high-stakes settings. How does the performance metric vary due to intrinsic uncertainty in the selected modeling approach? For example, train-test split is modified, initial weights for optimization are modified or hyperparameter tuning is done using an algorithm with probabilistic nature? This is shifting the focus from identifying a single best model to understanding a distribution of the performance metric that captures variability across different training conditions. By running multiple experiments with varied settings, empirical distributions of performance metrics can be generated. Analyzing these distributions can lead to more robust models that generalize well across diverse scenarios. This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty. Aimed at a statistically interested audience within the machine learning community, the suggested approaches are easy to implement and apply to various performance metrics for classification and regression problems. Given the often long training times in ML, particular attention is given to small sample sizes (in the order of 10-25).

Quantifying Uncertainty and Variability in Machine Learning: Confidence Intervals for Quantiles in Performance Metric Distributions

TL;DR

The paper reframes ML model evaluation as a study of performance metric distributions rather than single-point estimates, driven by sources of variation such as seed initialization and hyperparameter search. It advocates quantile-based analysis and confidence intervals for quantiles to quantify uncertainty, offering three nonparametric CI methods (exact, asymptotic, semiparametric bootstrap) plus a -interval benchmark for the mean. Through simulations and real-data use cases (classification and regression), it shows that nonparametric exact CIs deliver strong coverage, middle quantiles are estimated more reliably, and semiparametric bootstrap can assist in very small samples though tail behavior may be challenging; practical guidance suggests suffices for quantiles up to 90%. Overall, the framework enables robust, distribution-aware comparisons of training configurations and can inform data quality improvements and decision-making under uncertainty in ML pipelines.

Abstract

Machine learning models are widely used in applications where reliability and robustness are critical. Model evaluation often relies on single-point estimates of performance metrics such as accuracy, F1 score, or mean squared error, that fail to capture the inherent variability in model performance. This variability arises from multiple sources, including train-test split, weights initialization, and hyperparameter tuning. Investigating the characteristics of performance metric distributions, rather than focusing on a single point only, is essential for informed decision-making during model selection and optimization, especially in high-stakes settings. How does the performance metric vary due to intrinsic uncertainty in the selected modeling approach? For example, train-test split is modified, initial weights for optimization are modified or hyperparameter tuning is done using an algorithm with probabilistic nature? This is shifting the focus from identifying a single best model to understanding a distribution of the performance metric that captures variability across different training conditions. By running multiple experiments with varied settings, empirical distributions of performance metrics can be generated. Analyzing these distributions can lead to more robust models that generalize well across diverse scenarios. This contribution explores the use of quantiles and confidence intervals to analyze such distributions, providing a more complete understanding of model performance and its uncertainty. Aimed at a statistically interested audience within the machine learning community, the suggested approaches are easy to implement and apply to various performance metrics for classification and regression problems. Given the often long training times in ML, particular attention is given to small sample sizes (in the order of 10-25).

Paper Structure

This paper contains 12 sections, 10 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Examples of empirical distributions of tmoi based on approx. 1000.0 seed-controlled train runs.
  • Figure 2: Densities of all considered distributions in simulation. Dashed line in red: interdecile range (10%, 90%).
  • Figure 3: Average modulus relative bias of different quantile point estimators: sample quantile ($\hat{Q}$), interpolated quantile ($\hat{Q}_L$), bootstrap median. Points are indicating the corresponding relative RMSE. The black dashed line marks the 5% threshold. Shown results based on 2000.0 simulation runs and 2000.0 bootstrap samples for different quantile levels.
  • Figure 4: Empirical confidence level (along the horizontal axis) for different types of simulated ci for quantile levels $5\%, 10\%, 25\%, 50\%, 75\%, 90\%, 95\%$ and the mean for confidence levels $1-\alpha = 0.90, 0.95$ and sample sizes $n=10, 15, 25, 50$. Shown results based on 2000.0 simulation runs and 2000.0 bootstrap samples.
  • Figure 5: Normalized average interval length (reference: interdecile range) for different distributions for different types of simulated ci for quantile levels $5\%, 10\%, 25\%, 50\%, 75\%, 90\%, 95\%$ and the mean for confidence levels $1-\alpha = 0.90, 0.95$ and sample sizes $n=10, 15, 25, 50$. Shown results based on $2000.0$ simulation runs and $2000.0$ bootstrap samples.
  • ...and 5 more figures