Table of Contents
Fetching ...

Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner

Abstract

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

Cost-Efficient Estimation of General Abilities Across Benchmarks

Abstract

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

Paper Structure

This paper contains 20 sections, 14 equations, 3 figures.

Figures (3)

  • Figure 1: Left: Token usage and accuracy averaged across all LLMs for the tasks in WILD. Tasks vary in token usage, leading to radically different item costs. Right: MAE for predicting the sample mean over 112 held-out tasks. The MIRT model with Optimal Experimental Design selection achieves low MAE for small sample size and token cost.
  • Figure 2: Left: Baseline selectors vs. Adaptive V-Optimal. Right: Prediction comparison. MIRT v.s. baseline predictors as we increase the number of extrapolation tasks. Model Task Mean (MTM) baselines are constant because extrapolation tasks are fully held out for the test models. (EB) refers to empirical bayes. See \ref{['fig:calib_valid_full']} for interpolation results.
  • Figure 3: Summary of main results comparing the performance of the Nested MIRT and 2PL IRT model. (Left) Comparison in the static selection setting. (Center) Comparison in the adaptive selection setting. (Right) Comparison of token efficiency in the adaptive setting. In all three plots, Random selection corresponds to the Prediction setting (\ref{['sec:eval_paradigms']}).