Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

Rachel Longjohn; Giri Gopalan; Emily Casleton

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

Rachel Longjohn, Giri Gopalan, Emily Casleton

TL;DR

This work addresses the challenge of quantifying uncertainty when aggregating per-task performance across benchmarks for pretrained models. It combines bootstrapping and beta-binomial Bayesian hierarchical modeling to provide interval estimates for aggregated metrics and ranks, and introduces uncertainty-aware visualizations of task weighting. Applying these methods to VTAB reveals that dominance among models can depend on task weighting and normalization, and that some models are statistically indistinguishable once uncertainty is accounted for. The approach offers a practical blueprint for robust benchmarking of foundation models with explicit uncertainty quantification, aiding informed model selection and interpretation in real-world applications.

Abstract

Modern artificial intelligence is supported by machine learning models (e.g., foundation models) that are pretrained on a massive data corpus and then adapted to solve a variety of downstream tasks. To summarize performance across multiple tasks, evaluation metrics are often aggregated into a summary metric, e.g., average accuracy across 10 question-answering tasks. When aggregating evaluation metrics, it is useful to incorporate uncertainty in the aggregate metric in order to gain a more realistic understanding of model performance. Our objective in this work is to demonstrate how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks. The methods we emphasize are bootstrapping, Bayesian hierarchical (i.e., multilevel) modeling, and the visualization of task weightings that consider standard errors. These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance. We use a popular ML benchmark, the Visual Task Adaptation Benchmark (VTAB), to demonstrate the usefulness of our approaches.

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 6 figures, 11 tables)

This paper contains 19 sections, 10 equations, 6 figures, 11 tables.

Introduction
Why VTAB?
Related Work
Issues in Evaluating Pretrained Models and Contributions
Statistical Methods
Bootstrapping Evaluation Data
Bayesian Hierarchical Modeling of Evaluation Data
Simulation study
Weighting Task Performances
Metric Normalization
Application to VTAB
Confidence and Credibility Intervals for Aggregated Performances
Confidence and Credible Intervals for Aggregated Ranks
Visualizing Task Weightings with Uncertainty
Discussion and Conclusion
...and 4 more sections

Figures (6)

Figure 1: Illustration of challenges when evaluating foundation models.
Figure 2: Illustration of bootstrap procedure for aggregate metrics.
Figure 3: Visualization of differences in model performance under different category weightings using the unnormalized accuracies.
Figure 4: Visualization of differences in model performance under different category weightings using the normalized accuracies.
Figure 5: Posterior probabilities over model ranks for average accuracy weighted to favor structured image tasks using the Bayesian hierarchical model from \ref{['sec:BHM']} ($w_{Str} = 0.95$, $w_{Nat} = w_{Spe} = 0.025$).
...and 1 more figures

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

TL;DR

Abstract

Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)