No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Michael Krumdick; Charles Lovering; Varshini Reddy; Seth Ebner; Chris Tanner

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner

TL;DR

This work critically examines LLMs used as judges of text quality, revealing that judge performance hinges on the ability to answer the question and on the quality of provided references. The authors introduce two datasets, including BFF-Bench and a 1,200-answer Correctness Dataset, to study how reference quality affects judgments across multiple models and tasks. Key findings show that providing correct human references substantially improves alignment with human judgments, especially for hard questions, and that improving reference quality can outperform simply deploying a larger judge with synthetic references. The paper advocates for verifying references as a practical necessity in evaluation pipelines and discusses biases such as self-preference, offering guidance for more reliable and domain-aware assessments of frontier models.

Abstract

LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

TL;DR

Abstract

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)