Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick; Adam Wiemerslage; Seth Ebner; Charles Lovering; Chris Tanner

Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner

Abstract

Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

Cost-Efficient Estimation of General Abilities Across Benchmarks

Abstract

Cost-Efficient Estimation of General Abilities Across Benchmarks

Abstract

Paper Structure

Table of Contents

Figures (3)