Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Zhaoyi Zhou; Yuda Song; Andrea Zanette

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Zhaoyi Zhou, Yuda Song, Andrea Zanette

TL;DR

This paper tackles the high cost of human judgments in evaluating LLM head-to-head win rates. It proposes Control Variates Evaluation, a principled variance-reduction framework that combines human judgments with synthetic feedback from an automatic evaluator, preserving unbiasedness while reducing annotation needs. The method achieves significant annotation savings, with theoretical guarantees showing variance reduction by a factor of $(1-\rho^2)$ where $\rho$ is the human-synthetic correlation, and empirical results demonstrating up to $\text{24.8%}$ savings (and more with finetuning) across benchmarks such as Chatbot Arena and MT-Bench. The work provides a practical, generalizable approach to scalable LLM evaluation and offers a data-dependent metric, the human annotation saving ratio, to predict potential savings in future tasks.

Abstract

When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

TL;DR

Abstract

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)