Table of Contents
Fetching ...

Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang

TL;DR

The paper investigates how auxiliary information used to guide LLM-based evaluation—such as reference answers and rubrics—can bias judgments in complex tasks. It introduces ComplexEval, a two-tier benchmark that exposes six previously unidentified biases by applying adversarial attack strategies to both basic and advanced evaluation scenarios. The results show that all models suffer from these biases, with task complexity amplifying the effect and reasoning models paradoxically more vulnerable. The work highlights the need for robust, interpretable evaluation signals and provides a framework for developing more reliable LLM judges for diverse, nuanced tasks.

Abstract

As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

TL;DR

The paper investigates how auxiliary information used to guide LLM-based evaluation—such as reference answers and rubrics—can bias judgments in complex tasks. It introduces ComplexEval, a two-tier benchmark that exposes six previously unidentified biases by applying adversarial attack strategies to both basic and advanced evaluation scenarios. The results show that all models suffer from these biases, with task complexity amplifying the effect and reasoning models paradoxically more vulnerable. The work highlights the need for robust, interpretable evaluation signals and provides a framework for developing more reliable LLM judges for diverse, nuanced tasks.

Abstract

As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

Paper Structure

This paper contains 45 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: "When Help Become Harm": The figure shows that auxiliary information improves LLM judges but introduces new biases in complex evaluation
  • Figure 2: The diagram illustrates the two core components of our dataset: 1) ComplexEval-Basic - Focused on 12 generic scenarios, employing Comprehensive Attack for macro-level bias exploration. 2) ComplexEval-Advanced - Centered on 3 more complex scenarios, utilizing Targeted Attack for granular principle analysis
  • Figure 3: Cross-domain comparison of auxiliary information induced bias effects
  • Figure 4: Robustness variation between reasoning and general models with increasing domain complexity
  • Figure 5: Evaluation of issue detection accuracy against ground truth (y=x). Left: Multi-dimensional evaluation shows a hard ceiling ( 15 issues) despite increasing actual issues; Right: Single-dimension evaluation removes the ceiling but introduces a floor (1-2 false detections).
  • ...and 1 more figures