Assessing Judging Bias in Large Reasoning Models: An Empirical Study
Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, Bingsheng He
TL;DR
This work investigates how Large Reasoning Models (LRMs) perform as automated judges and whether their enhanced reasoning alters bias susceptibility relative to LLMs. It introduces a comprehensive benchmark across subjective preference data (DPO datasets) and objective fact data (Math, Chemistry, History, Psychology), evaluating four biases—bandwagon, authority, position, and distraction—and a novel superficial reflection bias. The study shows LRMs remain prone to biases, though they exhibit greater robustness on fact-related content, with a distinct last-position (position) bias and a superficial reflection effect that can be exploited by cues resembling deliberation. It offers three mitigation strategies—targeted system prompts, in-context learning, and self-reflection—with quantified improvements (e.g., up to 19% reduction via prompts, up to 27% improvement on preference tasks via ICL, and up to 16% reduction on fact tasks via self-reflection), finding self-reflection particularly effective for LRMs and ICL more beneficial for LLMs on preference tasks. The results inform the design of more reliable LLM-as-a-Judge systems and highlight model- and task-specific tradeoffs in bias mitigation.
Abstract
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel "superficial reflection bias" where phrases mimicking reasoning (e.g., "wait, let me think...") significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19\% in preference alignment datasets and 14\% in fact-related datasets, in-context learning that provides up to 27\% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10\% in preference datasets and 16\% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.
