Table of Contents
Fetching ...

On Speeding Up Language Model Evaluation

Jin Peng Zhou, Christian K. Belardi, Ruihan Wu, Travis Zhang, Carla P. Gomes, Wen Sun, Kilian Q. Weinberger

TL;DR

This work tackles the costly task of evaluating prompt and hyperparameter choices for large language models by casting it as a budgeted best-arm identification problem. It introduces two adaptive algorithms, UCB-E and UCB-E-LRF, where the former uses upper confidence bounds to select the next method-sample pair and the latter adds a low-rank factorization to exploit correlations across methods and samples for score imputation and uncertainty estimation. Empirical results on six benchmarks show substantial resource savings (5-15% of the full evaluation budget) with the two methods outperforming baselines, and a complementary analysis clarifies when each method excels. The work provides practical guidance for efficient LLM evaluation and software for reproducing results, with potential impact on rapid iteration in prompt engineering and hyperparameter tuning.

Abstract

Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an $\textit{adaptive}$ approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources -- resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.

On Speeding Up Language Model Evaluation

TL;DR

This work tackles the costly task of evaluating prompt and hyperparameter choices for large language models by casting it as a budgeted best-arm identification problem. It introduces two adaptive algorithms, UCB-E and UCB-E-LRF, where the former uses upper confidence bounds to select the next method-sample pair and the latter adds a low-rank factorization to exploit correlations across methods and samples for score imputation and uncertainty estimation. Empirical results on six benchmarks show substantial resource savings (5-15% of the full evaluation budget) with the two methods outperforming baselines, and a complementary analysis clarifies when each method excels. The work provides practical guidance for efficient LLM evaluation and software for reproducing results, with potential impact on rapid iteration in prompt engineering and hyperparameter tuning.

Abstract

Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources -- resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.
Paper Structure (47 sections, 1 theorem, 7 equations, 7 figures, 4 tables, 6 algorithms)

This paper contains 47 sections, 1 theorem, 7 equations, 7 figures, 4 tables, 6 algorithms.

Key Result

Corollary 1

Define $H_1 = \sum_{i=1, i\neq i^*}^m \frac{1}{(\mu_i-\mu_{i}^*)^2}$ and suppose $a=\frac{25}{36}\frac{T-m}{H_1}$, $\mathbb{P}_{\mathcal{A}_{\rm ue}}(\mathcal{A}_{\rm ue}(T, \mathcal{F}, \mathcal{X}; a)=i^*)\geq 1-2 Tm\exp\left(-\frac{T-m}{18H_1} \right)$.

Figures (7)

  • Figure 1: Cost comparison for finding the best model or prompt between our proposed algorithms (UCB-E, UCB-E-LRF) and Full Evaluation on two datasets. The overhead computation time for UCB-E and UCB-E-LRF is 2.4 and 142.6 seconds, respectively. See text for details.
  • Figure 2: Active method-example pair selection: After LLM evaluated $t$ method-example pairs, we then call Algorithm $\mathcal{A}$ to select the next method-example pair. Then we query LLM for evaluating this pair and fill the scoring received from LLM into the scoring matrix. Algorithm $\mathcal{A}$ then updates its internal status prepared for the next method-example pair selection. This process is repeated $T$ times and, in the end, the algorithm $\mathcal{A}$ predicts the best method $f_{\hat{i}^*}$.
  • Figure 3: Comparison of algorithms on six datasets evaluated with various metrics. The vertical axes represents performance of a metric and horizontal axes represents the percentage of method-example pairs evaluated. All results are aggregated based on 50 trials with different random seeds. The datasets are ordered by decreasing $H_1$ (see Section \ref{['sec:ucb_e']}). Larger $H_1$ values indicate settings where finding the best method is more difficult. Our proposed algorithms: UCB-E and UCB-E-LRF consistently require much less evaluation to achieve the same performance as baselines.
  • Figure 4: Ablations of the proposed algorithms and hyperparameters. UCB-E and UCB-E-LRF are denoted by blue and red lines, respectively, consistent with Figure \ref{['fig:main-figure']}.
  • Figure 5: Bar plot of model performance on all examples in all six datasets. Examples are ordered by their rank. Dataset $H_1$ values are displayed next to dataset name. It can be seen that datasets, such as AlpacaEval, that have a large gap between the best and second best method have small $H_1$ values. The $H_1$ value indicates the difficulty of identifying the best method with smaller values indicating easier identification.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Corollary 1: Lower bound on success probability of UCB-E