Table of Contents
Fetching ...

Precise Model Benchmarking with Only a Few Observations

Riccardo Fogliato, Pratik Patil, Nil-Jana Akpinar, Mathew Monfort

TL;DR

This work prescribes a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance.

Abstract

How can we precisely estimate a large language model's (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model's accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model's accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.

Precise Model Benchmarking with Only a Few Observations

TL;DR

This work prescribes a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance.

Abstract

How can we precisely estimate a large language model's (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model's accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model's accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.
Paper Structure (32 sections, 9 equations, 7 figures, 3 tables)

This paper contains 32 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Estimates of LLM accuracy and their 95% confidence intervals for predictions made by Gemma-2b across various subgroups on a subset of HellaSwag. The empirical Bayes estimates have precision similar to the direct estimator for some subgroups (e.g., clean and jerk) and higher precision for others (e.g., high jump). This approach also provides tighter confidence intervals for the estimates; e.g., see running and sports topics.
  • Figure 2: Comparison of methods to estimate the accuracy of LLMs across datasets. The plot shows the ratios of the estimates MSEs, obtained using regression (SR) and empirical Bayes (EB) methods, relative to the direct estimator (DT) for the LLM's accuracies on LLM-task subgroups (in parenthesis when pre-defined). Lower ratio values indicate more accurate estimates compared to DT. EB consistently provides more precise estimates than both SR and DT across most evaluations.
  • Figure 3: Comparison of subgroup MSEs across methods. The plot compares the MSEs across all subgroups (LLM-domain pairs) across four datasets. SR tends to perform better than DT on small subgroups but not always on larger ones. EB performs better than both on either. An MSE$=0.01$ means that, on average, we have $|\widehat{\mu}_g-\mu_g|=0.1$.
  • Figure 4: Average coverage and width of 95% confidence intervals for DT and EB estimates of LLM accuracy across datasets. EB intervals maintain high coverage and are generally narrower than those of DT.
  • Figure 5: Comparison of methods across datasets to estimate CLIP's zero-shot accuracies on subgroups of classification tasks. See the full list of tasks in \ref{['sec:vision_experiments']}. The observations correspond to the ratio between the average MSE of SR or EB over the average MSE of DT estimates. EB yields more precise estimates than SR and DT across most datasets and models.
  • ...and 2 more figures