Table of Contents
Fetching ...

Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout

TL;DR

This work introduces a plug-in module, the Reasoning-based Bias Detector (RBD), to mitigate biases in LLM-as-a-Judge evaluations by generating structured reasoning to guide evaluator self-correction. RBD remains external to the evaluator and iteratively identifies bias across four representative types—verbosity, position, bandwagon, and sentiment—while training on a distilled reasoning corpus to produce bias-aware traces. Across four RBD sizes (1.5B–14B) and eight evaluators, RBD consistently improves evaluation accuracy and consistency, often by substantial margins, and demonstrates generalization across biases, domains, and external benchmarks with modest latency and cost. The results show that reasoning-based supervision outperforms bias-label-only approaches and is more robust to prompt variations and multi-bias interactions, making RBD a scalable solution for trustworthy LLM evaluation in both open- and closed-source settings. Overall, RBD provides a practical, interpretable enhancement to LLM-based judging, enabling more reliable automatic evaluation in diverse NLP tasks.

Abstract

LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

TL;DR

This work introduces a plug-in module, the Reasoning-based Bias Detector (RBD), to mitigate biases in LLM-as-a-Judge evaluations by generating structured reasoning to guide evaluator self-correction. RBD remains external to the evaluator and iteratively identifies bias across four representative types—verbosity, position, bandwagon, and sentiment—while training on a distilled reasoning corpus to produce bias-aware traces. Across four RBD sizes (1.5B–14B) and eight evaluators, RBD consistently improves evaluation accuracy and consistency, often by substantial margins, and demonstrates generalization across biases, domains, and external benchmarks with modest latency and cost. The results show that reasoning-based supervision outperforms bias-label-only approaches and is more robust to prompt variations and multi-bias interactions, making RBD a scalable solution for trustworthy LLM evaluation in both open- and closed-source settings. Overall, RBD provides a practical, interpretable enhancement to LLM-based judging, enabling more reliable automatic evaluation in diverse NLP tasks.

Abstract

LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator's limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types--verbosity, position, bandwagon, and sentiment--evaluated using 8 LLM evaluators demonstrate RBD's strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD's effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

Paper Structure

This paper contains 58 sections, 4 equations, 22 figures, 17 tables, 1 algorithm.

Figures (22)

  • Figure 1: Overview of the Reasoning-based Bias Detector (RBD) framework. During RBD inference, it examines biased evaluation results produced by an LLM-as-a-Judge. If bias is identified, RBD generates a reasoning-based bias analysis to guide the LLM in reflecting on and potentially revising its evaluation; otherwise, the original judgment remains unchanged. To train RBD, we design a data collection and distilled reasoning-based training pipeline. We first construct a biased dataset containing specific types of bias and collect possibly biased evaluation results from the LLM evaluator. Then, a teacher Language Reasoning Model (LRM) produces bias analysis thinking based on the evaluation context. These analyses are filtered and used to fine-tune a base LRM into the final RBD model capable of identifying and correcting evaluation bias.
  • Figure 1: Base datasets used to construct the original and biased datasets. GSM8K is a math QA dataset with reasoning and final answers; Arena contains AI-generated chat instruction pairs; and ScienceQA includes multimodal multiple-choice science questions.
  • Figure 2: Overview of the bias dataset construction, illustrating how we create the specific biased dataset for each bias (Verbosity, Position, Bandwagon, Sentiment).
  • Figure 3: Performance comparison across four types of bias in $\mathcal{D}$ and $\mathcal{D}_\text{bias}$. (a)–(d) show accuracy and consistency drops for each bias type. (e) summarizes the percentage of biased examples.
  • Figure 4: Comparison of reasoning-based and label-only fine-tuning on the original test set and two diagnostic sets.
  • ...and 17 more figures