Tuning LLM Judge Design Decisions for 1/1000 of the Cost

David Salinas; Omar Swelam; Frank Hutter

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

David Salinas, Omar Swelam, Frank Hutter

TL;DR

This work addresses the high cost of evaluating instruction-tuned models by tuning LLM-based judges through a systematic, multi-fidelity, multi-objective search. By exploring a large configuration space of base models, prompts, and inference settings, the authors identify judge configurations that improve accuracy while reducing cost, and demonstrate open-weight models that match or exceed prior benchmarks on LMSys, PandaLM, and Arena-Hard datasets. Key contributions include scaling analyses showing limitations of mere model size, a comprehensive tuning framework across 4480 configurations, and a Pareto-based optimization approach that balances performance and expense. The results promote cost-effective, reproducible evaluation pipelines and stimulate community adoption of open-weight judges, while also acknowledging biases and proposing future improvements in stability and fairness.

Abstract

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

TL;DR

Abstract

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)