Table of Contents
Fetching ...

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai, Ben Leong

TL;DR

The paper tackles the reliability of LLMs as judges for open-ended outputs by quantifying an agreeableness bias in validator performance, where validators show high TPR but low TNR. It introduces a 366-program Python code-feedback benchmark, demonstrates the limitations of majority voting, and proposes a minority-veto ensemble and a regression-based calibration framework that uses a small human-annotated calibration set. The regression method achieves a maximum error of 1.2% on held-out data, doubling the accuracy of the best ensemble, and remains robust to missing data, enabling scalable, dependable automated benchmarking. This work advances scalable evaluation for rapidly evolving LLMs and provides datasets and methods applicable to other subjective benchmarks.

Abstract

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

TL;DR

The paper tackles the reliability of LLMs as judges for open-ended outputs by quantifying an agreeableness bias in validator performance, where validators show high TPR but low TNR. It introduces a 366-program Python code-feedback benchmark, demonstrates the limitations of majority voting, and proposes a minority-veto ensemble and a regression-based calibration framework that uses a small human-annotated calibration set. The regression method achieves a maximum error of 1.2% on held-out data, doubling the accuracy of the best ensemble, and remains robust to missing data, enabling scalable, dependable automated benchmarking. This work advances scalable evaluation for rapidly evolving LLMs and provides datasets and methods applicable to other subjective benchmarks.

Abstract

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

Paper Structure

This paper contains 25 sections, 12 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Correlation between LLM Elo ratings chiang2024chatbot and their performance as generators and validators on high-school programming feedback.
  • Figure 2: Workflow when using LLM to evaluate other LLM-generated feedback.
  • Figure 3: Predicted precision of generators by $14$ validators.
  • Figure 4: Agreeableness bias in LLM validators: high TPR ( $v^+_j \geq 96\%$) but low TNR ($v^-_j \leq25\%$).
  • Figure 5: Valid voting strategies, which assign "valid" label based on threshold, mirror invalid voting after fixing missing values.
  • ...and 3 more figures