Table of Contents
Fetching ...

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Zhaoyi Zhou, Yuda Song, Andrea Zanette

TL;DR

This paper tackles the high cost of human judgments in evaluating LLM head-to-head win rates. It proposes Control Variates Evaluation, a principled variance-reduction framework that combines human judgments with synthetic feedback from an automatic evaluator, preserving unbiasedness while reducing annotation needs. The method achieves significant annotation savings, with theoretical guarantees showing variance reduction by a factor of $(1-\rho^2)$ where $\rho$ is the human-synthetic correlation, and empirical results demonstrating up to $\text{24.8%}$ savings (and more with finetuning) across benchmarks such as Chatbot Arena and MT-Bench. The work provides a practical, generalizable approach to scalable LLM evaluation and offers a data-dependent metric, the human annotation saving ratio, to predict potential savings in future tasks.

Abstract

When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

TL;DR

This paper tackles the high cost of human judgments in evaluating LLM head-to-head win rates. It proposes Control Variates Evaluation, a principled variance-reduction framework that combines human judgments with synthetic feedback from an automatic evaluator, preserving unbiasedness while reducing annotation needs. The method achieves significant annotation savings, with theoretical guarantees showing variance reduction by a factor of where is the human-synthetic correlation, and empirical results demonstrating up to savings (and more with finetuning) across benchmarks such as Chatbot Arena and MT-Bench. The work provides a practical, generalizable approach to scalable LLM evaluation and offers a data-dependent metric, the human annotation saving ratio, to predict potential savings in future tasks.

Abstract

When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.

Paper Structure

This paper contains 44 sections, 1 theorem, 13 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Proposition 4.1

Suppose the expectations, variances, covariances and correlation coefficients, unless otherwise stated, are taken under the distribution $x \sim \mathrm{Uniform}(\mathcal{X})$, $y^1 \sim \ell^1(\cdot \mid x)$, $y^2 \sim \ell^2(\cdot \mid x)$. Then the control variates estimate $z^{\mathsf{cv}; \alph

Figures (9)

  • Figure 1: (Left) Illustration of Control Variates Evaluation, which makes use of a possibly inaccurate synthetic evaluator to reduce the variance of evaluation, reducing the need of human annotations while preserving unbiasedness. (Right) Averaged mean square error v.s. number of human annotations for Human Evaluation, Synthetic Evaluation and Control Variates Evaluation using the finetuned Skywork-8B evaluator on Chatbot Arena. The Synthetic Evaluation has high bias, while the bias of Human and Control Variates Evaluations are negligible. Control Variates Evaluation reduces the variance of Human Evaluation.
  • Figure 2: OpenAI's prompting users for feedback; excessive requests may negatively impact user experience.
  • Figure 3: Averaged mean-square error versus number of human annotations for Skywork-8B (pretrained and finetuned) on Chatbot Arena. The $x$-coordinate of curves "Human" and "Control Variates" correspond to the number of human annotations zheng2023judging. The curve "Human (shifted)" is derived by horizontally scaling the Human Evaluation curve by $(1-s)$, in which $s$ is the averaged human annotation saving ratio in \ref{['tab:result_save']}. The averaged mean-square error of Control Variates Evaluation converges to near 0, indicating that it has negligible bias. The human annotation saving ratio aligns perfectly with the actual variance relationship between Human Evaluation and Control Variates Evaluation.
  • Figure 4: Averaged human annotation saving ratio before and after fine-tuning for GRM-2B and Skywork-8B on Chatbot Arena and MT-Bench. Under all setups, we observe at least 5% increase in the saving ratio.
  • Figure 5: Average mean square error versus number of human annotations for GPT-4 evaluator on Chatbot Arena zheng2023judging. Note that even GPT-4 has high bias if used alone for Synthetic Evaluation.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 4.1: Control Variates Properties lavenberg1981perspective
  • Definition 4.2: Human annotation saving ratio