Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks
Rachel Longjohn, Giri Gopalan, Emily Casleton
TL;DR
This work addresses the challenge of quantifying uncertainty when aggregating per-task performance across benchmarks for pretrained models. It combines bootstrapping and beta-binomial Bayesian hierarchical modeling to provide interval estimates for aggregated metrics and ranks, and introduces uncertainty-aware visualizations of task weighting. Applying these methods to VTAB reveals that dominance among models can depend on task weighting and normalization, and that some models are statistically indistinguishable once uncertainty is accounted for. The approach offers a practical blueprint for robust benchmarking of foundation models with explicit uncertainty quantification, aiding informed model selection and interpretation in real-world applications.
Abstract
Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.
