Table of Contents
Fetching ...

RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma

TL;DR

RocketEval reframes automated LLM evaluation as a multi-faceted Q&A task by using instance-specific checklists created by a strong LLM and graded by lightweight LLMs. By enforcing independent checklist judgments and introducing a conditional normalized scoring scheme, it mitigates the uncertainty and position bias of lightweight judges and enables both unsupervised and supervised score predictions, including an alpha-weighted blend with annotated data. Empirically, RocketEval achieves high alignment with human preferences (e.g., $r_s=0.965$ on WildBench with Gemma-2-2B as judge) while delivering orders-of-magnitude cost reductions compared to using GPT-4o. The approach offers a scalable, reproducible, and interpretable pathway for large-scale LLM evaluation and comparison.

Abstract

Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR .

RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

TL;DR

RocketEval reframes automated LLM evaluation as a multi-faceted Q&A task by using instance-specific checklists created by a strong LLM and graded by lightweight LLMs. By enforcing independent checklist judgments and introducing a conditional normalized scoring scheme, it mitigates the uncertainty and position bias of lightweight judges and enables both unsupervised and supervised score predictions, including an alpha-weighted blend with annotated data. Empirically, RocketEval achieves high alignment with human preferences (e.g., on WildBench with Gemma-2-2B as judge) while delivering orders-of-magnitude cost reductions compared to using GPT-4o. The approach offers a scalable, reproducible, and interpretable pathway for large-scale LLM evaluation and comparison.

Abstract

Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR .

Paper Structure

This paper contains 23 sections, 4 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Agreements with MT-Bench Human Judgments with different LLM judges. "CoT" indicates the judgments derived using the original Chain-of-thought (CoT) prompting, and "Ours" indicates the judgments derived using our proposed RocketEval framework.
  • Figure 2: Illustration of RocketEval framework for automated LLM evaluation. The framework consists of three components: Checklist Creation, Checklist Grading and Score Prediction.
  • Figure 3: WildBench scores predicted by different LLM judges and the ranking correlation with GPT-4o.
  • Figure 4: Ratio of disagreement on WildBench with different number of sampling times.
  • Figure 5: Ratio of disagreement on WildBench with checklist items in different positions.
  • ...and 12 more figures