Table of Contents
Fetching ...

Assessing Judging Bias in Large Reasoning Models: An Empirical Study

Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, Bingsheng He

TL;DR

This work investigates how Large Reasoning Models (LRMs) perform as automated judges and whether their enhanced reasoning alters bias susceptibility relative to LLMs. It introduces a comprehensive benchmark across subjective preference data (DPO datasets) and objective fact data (Math, Chemistry, History, Psychology), evaluating four biases—bandwagon, authority, position, and distraction—and a novel superficial reflection bias. The study shows LRMs remain prone to biases, though they exhibit greater robustness on fact-related content, with a distinct last-position (position) bias and a superficial reflection effect that can be exploited by cues resembling deliberation. It offers three mitigation strategies—targeted system prompts, in-context learning, and self-reflection—with quantified improvements (e.g., up to 19% reduction via prompts, up to 27% improvement on preference tasks via ICL, and up to 16% reduction on fact tasks via self-reflection), finding self-reflection particularly effective for LRMs and ICL more beneficial for LLMs on preference tasks. The results inform the design of more reliable LLM-as-a-Judge systems and highlight model- and task-specific tradeoffs in bias mitigation.

Abstract

Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.

Assessing Judging Bias in Large Reasoning Models: An Empirical Study

TL;DR

This work investigates how Large Reasoning Models (LRMs) perform as automated judges and whether their enhanced reasoning alters bias susceptibility relative to LLMs. It introduces a comprehensive benchmark across subjective preference data (DPO datasets) and objective fact data (Math, Chemistry, History, Psychology), evaluating four biases—bandwagon, authority, position, and distraction—and a novel superficial reflection bias. The study shows LRMs remain prone to biases, though they exhibit greater robustness on fact-related content, with a distinct last-position (position) bias and a superficial reflection effect that can be exploited by cues resembling deliberation. It offers three mitigation strategies—targeted system prompts, in-context learning, and self-reflection—with quantified improvements (e.g., up to 19% reduction via prompts, up to 27% improvement on preference tasks via ICL, and up to 16% reduction on fact tasks via self-reflection), finding self-reflection particularly effective for LRMs and ICL more beneficial for LLMs on preference tasks. The results inform the design of more reliable LLM-as-a-Judge systems and highlight model- and task-specific tradeoffs in bias mitigation.

Abstract

Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.

Paper Structure

This paper contains 19 sections, 3 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: We develop a comprehensive framework to systematically evaluate judging biases across LLMs and LRMs, with three primary objectives: (1) assessing bias susceptibility in LRMs during evaluation tasks, (2) comparing judging bias patterns between LLMs and LRMs, (3) analyzing the formation of evaluation biases in LRMs' reasoning processes, and (4) identifying new judging biases in LRMs.
  • Figure 2: DeepSeek-family models' accuracy comparison when inserting "wait, wait, wait... let me think about it" between answer options.
  • Figure 3: Bandwagon Bias Injection. Black text is original question. Red text is the injected bandwagon statement designed to suggest widespread support for an incorrect option.
  • Figure 4: Authority Bias Injection. Black text is original question. Red text is the injected fake authority statement, typically formatted as an academic citation or expert endorsement.
  • Figure 5: Position Bias Injection. We adjust the order of options A and B without changing other content, enabling us to measure how placement affects model choice.
  • ...and 4 more figures