Table of Contents
Fetching ...

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner

TL;DR

This work critically examines LLMs used as judges of text quality, revealing that judge performance hinges on the ability to answer the question and on the quality of provided references. The authors introduce two datasets, including BFF-Bench and a 1,200-answer Correctness Dataset, to study how reference quality affects judgments across multiple models and tasks. Key findings show that providing correct human references substantially improves alignment with human judgments, especially for hard questions, and that improving reference quality can outperform simply deploying a larger judge with synthetic references. The paper advocates for verifying references as a practical necessity in evaluation pipelines and discusses biases such as self-preference, offering guidance for more reliable and domain-aware assessments of frontier models.

Abstract

LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.

No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding

TL;DR

This work critically examines LLMs used as judges of text quality, revealing that judge performance hinges on the ability to answer the question and on the quality of provided references. The authors introduce two datasets, including BFF-Bench and a 1,200-answer Correctness Dataset, to study how reference quality affects judgments across multiple models and tasks. Key findings show that providing correct human references substantially improves alignment with human judgments, especially for hard questions, and that improving reference quality can outperform simply deploying a larger judge with synthetic references. The paper advocates for verifying references as a practical necessity in evaluation pipelines and discusses biases such as self-preference, offering guidance for more reliable and domain-aware assessments of frontier models.

Abstract

LLM-as-a-Judge is a framework that uses an LLM (large language model) to evaluate the quality of natural language text - typically text that is also generated by an LLM. This framework holds great promise due to its relative low-cost, ease of use, and strong correlations with human stylistic preferences. However, LLM Judges have been shown to exhibit biases that can distort their judgments. We evaluate how well LLM Judges can grade whether a given response to a conversational question is correct, an ability crucial to soundly estimating the overall response quality. To do so, we create and publicly release a human-annotated dataset with labels of correctness for 1,200 LLM responses. We source questions from a combination of existing datasets and a novel, challenging benchmark (BFF-Bench) created for this analysis. We demonstrate a strong connection between an LLM's ability to correctly answer a question and grade responses to that question. Although aggregate level statistics might imply a judge has high agreement with human annotators, it will struggle on the subset of questions it could not answer. To address this issue, we recommend a simple solution: provide the judge with a correct, human-written reference answer. We perform an in-depth analysis on how reference quality can affect the performance of an LLM Judge. We show that providing a weaker judge (e.g. Qwen 2.5 7B) with higher quality references reaches better agreement with human annotators than a stronger judge (e.g. GPT-4o) with synthetic references.

Paper Structure

This paper contains 29 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: LLM Judge agreement with human annotators for the two strongest judges on the pairwise correctness judgment task. "None" means having no reference, "Self" means having a reference generated by the judge model, and "Human" means having a human-written gold reference.
  • Figure 2: Example of an incorrect reference included in MT-Bench along with our corrected reference. Error and correction are highlighted in bold.
  • Figure 3: Truncated example of a question from BFF-Bench. For each question, we include a human-written gold answer that contains the final answer and a complete chain-of-thought reasoning. The untruncated version of this example is available in \ref{['fig:bffbench-sample-full']} in the appendix.
  • Figure 4: Error analysis for the judge models conditioned on the grading references. Self refers to the cases in which the judge model was judging responses from itself. Others refers to when it was judging responses from other models.
  • Figure 5: The full graph displaying the error rates for different LLM Judges and different reference types. The same trend displayed in \ref{['fig:affinity_bias']} holds for each model. Rand. refers to the Random reference type.
  • ...and 1 more figures