Table of Contents
Fetching ...

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush

TL;DR

The paper addresses the problem of label-induced bias in LLM-based evaluations by conducting a controlled, multi-model study where three LLMs evaluate each other's blog posts under four labeling conditions. It combines holistic preference voting and granular quality ratings across Coherence, Informativeness, and Conciseness to quantify biases and assess their robustness across evaluation formats. Key findings show the Claude label elevates scores while Gemini depresses them, with false labeling capable of reversing rankings by up to 50 percentage points and informativeness emerging as the most label-sensitive dimension. The work highlights the need for blind, multi-model, and bias-aware evaluation frameworks to improve fairness and validity in automated benchmarking of text quality.

Abstract

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

TL;DR

The paper addresses the problem of label-induced bias in LLM-based evaluations by conducting a controlled, multi-model study where three LLMs evaluate each other's blog posts under four labeling conditions. It combines holistic preference voting and granular quality ratings across Coherence, Informativeness, and Conciseness to quantify biases and assess their robustness across evaluation formats. Key findings show the Claude label elevates scores while Gemini depresses them, with false labeling capable of reversing rankings by up to 50 percentage points and informativeness emerging as the most label-sensitive dimension. The work highlights the need for blind, multi-model, and bias-aware evaluation frameworks to improve fairness and validity in automated benchmarking of text quality.

Abstract

Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the "Claude" label consistently elevated scores regardless of actual authorship, while the "Gemini" label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

Paper Structure

This paper contains 6 sections, 11 figures.

Figures (11)

  • Figure 1: Percentage-based overall scores under the No Label condition.
  • Figure 2: Percentage-based overall scores under the True Label condition.
  • Figure 3: Percentage-based overall scores under False Label Scenario 1 (ChatGPT-as-Gemini, Gemini-as-Claude, Claude-as-ChatGPT).
  • Figure 4: Point-based scores under the No Label condition.
  • Figure 5: Point-based scores under the True Label condition.
  • ...and 6 more figures