Table of Contents
Fetching ...

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

David Salinas, Omar Swelam, Frank Hutter

TL;DR

This work addresses the high cost of evaluating instruction-tuned models by tuning LLM-based judges through a systematic, multi-fidelity, multi-objective search. By exploring a large configuration space of base models, prompts, and inference settings, the authors identify judge configurations that improve accuracy while reducing cost, and demonstrate open-weight models that match or exceed prior benchmarks on LMSys, PandaLM, and Arena-Hard datasets. Key contributions include scaling analyses showing limitations of mere model size, a comprehensive tuning framework across 4480 configurations, and a Pareto-based optimization approach that balances performance and expense. The results promote cost-effective, reproducible evaluation pipelines and stimulate community adoption of open-weight judges, while also acknowledging biases and proposing future improvements in stability and fairness.

Abstract

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

TL;DR

This work addresses the high cost of evaluating instruction-tuned models by tuning LLM-based judges through a systematic, multi-fidelity, multi-objective search. By exploring a large configuration space of base models, prompts, and inference settings, the authors identify judge configurations that improve accuracy while reducing cost, and demonstrate open-weight models that match or exceed prior benchmarks on LMSys, PandaLM, and Arena-Hard datasets. Key contributions include scaling analyses showing limitations of mere model size, a comprehensive tuning framework across 4480 configurations, and a Pareto-based optimization approach that balances performance and expense. The results promote cost-effective, reproducible evaluation pipelines and stimulate community adoption of open-weight judges, while also acknowledging biases and proposing future improvements in stability and fairness.

Abstract

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .

Paper Structure

This paper contains 49 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Effect of scaling the LLM judge and increasing the number of instructions on Spearman correlation. In contrast to human agreement, neither Alpaca-Eval, Arena-Hard, nor their union distinguishes the quality difference between 32B and 72B models.
  • Figure 2: Effect on scaling the LLM judge and the number of instructions on human-agreement.
  • Figure 3: Illustration of the prompt templating approach. We parametrize the prompt with the following hyperparameters: Provide answer, Provide explanation, Provide example, use JSON, output preference format. Given each of the $2^4\times5=80$ prompt hyperparameter, we generate a prompt like this one.
  • Figure 4: Illustration of the selection process. All 4480 configurations are first evaluated on 400 instructions (left), the top 1200 configurations are then evaluated on 1200 instructions (center) and finally the top 400 configurations are evaluated on 3548 instructions (right). The color denotes the ranking assigned by the non-dominated sort procedure.
  • Figure 5: We plot the cost per annotation and human agreement of all 4480 judges when using 400 instructions. The model family and the number of parameters are represented with color and size respectively.
  • ...and 8 more figures