Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector
Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout
TL;DR
This work introduces a plug-in module, the Reasoning-based Bias Detector (RBD), to mitigate biases in LLM-as-a-Judge evaluations by generating structured reasoning to guide evaluator self-correction. RBD remains external to the evaluator and iteratively identifies bias across four representative types—verbosity, position, bandwagon, and sentiment—while training on a distilled reasoning corpus to produce bias-aware traces. Across four RBD sizes (1.5B–14B) and eight evaluators, RBD consistently improves evaluation accuracy and consistency, often by substantial margins, and demonstrates generalization across biases, domains, and external benchmarks with modest latency and cost. The results show that reasoning-based supervision outperforms bias-label-only approaches and is more robust to prompt variations and multi-bias interactions, making RBD a scalable solution for trustworthy LLM evaluation in both open- and closed-source settings. Overall, RBD provides a practical, interpretable enhancement to LLM-based judging, enabling more reliable automatic evaluation in diverse NLP tasks.
Abstract
LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.
