Table of Contents
Fetching ...

Reliable and Efficient Amortized Model-based Evaluation

Sang Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo

TL;DR

The paper tackles the high cost and instability of evaluating language models across large benchmarks by introducing a model-based evaluation framework grounded in Item Response Theory (IRT). It couples amortized calibration with a content-aware difficulty predictor and a conditional question generator to build scalable, adaptive evaluation pipelines that deconfound model ability from question difficulty. Across 22 NLP benchmarks and 172 LLMs, the approach demonstrates improved reliability and cost-efficiency, with substantial reductions in calibration and querying requirements and strong generalization to new datasets and models. The work advances practical, iterative model evaluation and lays groundwork for broader AI assessment using calibrated, difficulty-aware question banks and adaptive testing strategies.

Abstract

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.

Reliable and Efficient Amortized Model-based Evaluation

TL;DR

The paper tackles the high cost and instability of evaluating language models across large benchmarks by introducing a model-based evaluation framework grounded in Item Response Theory (IRT). It couples amortized calibration with a content-aware difficulty predictor and a conditional question generator to build scalable, adaptive evaluation pipelines that deconfound model ability from question difficulty. Across 22 NLP benchmarks and 172 LLMs, the approach demonstrates improved reliability and cost-efficiency, with substantial reductions in calibration and querying requirements and strong generalization to new datasets and models. The work advances practical, iterative model evaluation and lays groundwork for broader AI assessment using calibrated, difficulty-aware question banks and adaptive testing strategies.

Abstract

Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.

Paper Structure

This paper contains 14 sections, 10 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Method overview: The response matrix $Y$ records the response of test takers (e.g., generative models) on current benchmark questions, with blue, red, and white cells indicating corrected, incorrected, or missing responses (Subfigure a). The test taker's ability $\theta_i$ and question difficulty $z_j$ determine correct probabilities (Subfigure b). Calibration estimates question difficulty $\widehat{z_j}$ for adaptive testing, improving evaluation efficiency for new test takers ("new models" and "current datasets" in Subfigure a). Parameters $\phi, \psi, \omega$ govern the question difficulty predictor, conditional question generator, and featurizer (Subfigures b, c). The question difficulty model predicts $\widehat{z_{\text{new}}}$ to reduce calibration costs, while the conditional question generator creates questions targeting specific difficulty level to expand the question bank.
  • Figure 2: AUC on the test set of different response models.
  • Figure 3: IRT consistently outperforms subset average score in AUC across datasets. Subset average scores are more sensitive to sample selection, while IRT estimates demonstrate greater generalizability and robustness.
  • Figure 4: Comparing amortized and traditional calibration on model fit and ability estimation quality, each blue and red dot represents a dataset's train and test split. The x- and y-axes show metric values from amortized and traditional calibration, respectively. The comparable AUC across both methods indicates the amortized Rasch model fits as well as the traditional approach, with a compatible ability to estimate quality, confirming the effectiveness of amortization.
  • Figure 5: Adaptive testing improves sample complexity on AIRBench. Fisher large and Fisher small are adaptive testing experiments based on a large (5236 questions) and a small (50 questions) question bank, respectively. The random selection uses a large question bank. With a budget of 50 questions, only the Fisher-large strategy can reach the measurement target.
  • ...and 6 more figures