Table of Contents
Fetching ...

Mitigating the Bias of Large Language Model Evaluation

Hongli Zhou, Hui Huang, Yunfei Long, Bing Xu, Conghui Zhu, Hailong Cao, Muyun Yang, Tiejun Zhao

TL;DR

This work proposes systematic research about the bias of LLM-as-a-Judge, and proposes to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality.

Abstract

Recently, there has been a trend of evaluating the Large Language Model (LLM) quality in the flavor of LLM-as-a-Judge, namely leveraging another LLM to evaluate the current output quality. However, existing judges are proven to be biased, namely they would favor answers which present better superficial quality (such as verbosity, fluency) while ignoring the instruction following ability. In this work, we propose systematic research about the bias of LLM-as-a-Judge. Specifically, for closed-source judge models, we apply calibration to mitigate the significance of superficial quality, both on probability level and prompt level. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality. We apply our methods on the bias evaluation benchmark, and experiment results show our methods mitigate the bias by a large margin while maintaining a satisfactory evaluation accuracy.

Mitigating the Bias of Large Language Model Evaluation

TL;DR

This work proposes systematic research about the bias of LLM-as-a-Judge, and proposes to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality.

Abstract

Recently, there has been a trend of evaluating the Large Language Model (LLM) quality in the flavor of LLM-as-a-Judge, namely leveraging another LLM to evaluate the current output quality. However, existing judges are proven to be biased, namely they would favor answers which present better superficial quality (such as verbosity, fluency) while ignoring the instruction following ability. In this work, we propose systematic research about the bias of LLM-as-a-Judge. Specifically, for closed-source judge models, we apply calibration to mitigate the significance of superficial quality, both on probability level and prompt level. For open-source judge models, we propose to mitigate the bias by contrastive training, with curated negative samples that deviate from instruction but present better superficial quality. We apply our methods on the bias evaluation benchmark, and experiment results show our methods mitigate the bias by a large margin while maintaining a satisfactory evaluation accuracy.
Paper Structure (14 sections, 6 equations, 4 figures, 2 tables)

This paper contains 14 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The three prompts for modeling superficial quality. The text in the blue blocks is the common part of the prompts, and together with the text in the dotted blocks, they constitute three types of prompts.
  • Figure 2: An example of online mitigation by calibration for generation-based evaluation. The correct output is answer 1 is better. It can be seen that the bias of LLM evaluation is effectively mitigated by subtracting the superficial quality.
  • Figure 3: The pipeline of our offline mitigation. The first step is to construct negative samples in the original dataset D, and the second step is to perform contrastive training with the original data and the newly constructed data.
  • Figure 4: The variation of accuracy with respect to different coefficient. Left denotes the result of Text-davinci-003 on probability calibration, and right denotes the result of GPT-4 on prompt calibration.