Table of Contents
Fetching ...

Explaining Length Bias in LLM-Based Preference Evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, Hui Xiong

TL;DR

The paper tackles the problem of length bias in LLM-based preference evaluations by decomposing win rate into desirability (length-independent trustworthiness) and information mass (length-dependent content). It introduces a theoretical framework and the AdapAlpaca method, which matches reference and test lengths to enable fair content-quality assessment, and a Quality Enhancement prompt to boost desirability and information mass. Empirical results show desirability and information mass influence win rates in predictable ways, and AdapAlpaca provides more human-aligned evaluations than standard AlpacaEval, with DPO gains partly rooted in length. The work offers a practical pathway to unbiased benchmarking of RLHF/DPO systems and highlights RLHF-induced length bias as a key concern for scalable evaluation of LLMs.

Abstract

The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.

Explaining Length Bias in LLM-Based Preference Evaluations

TL;DR

The paper tackles the problem of length bias in LLM-based preference evaluations by decomposing win rate into desirability (length-independent trustworthiness) and information mass (length-dependent content). It introduces a theoretical framework and the AdapAlpaca method, which matches reference and test lengths to enable fair content-quality assessment, and a Quality Enhancement prompt to boost desirability and information mass. Empirical results show desirability and information mass influence win rates in predictable ways, and AdapAlpaca provides more human-aligned evaluations than standard AlpacaEval, with DPO gains partly rooted in length. The work offers a practical pathway to unbiased benchmarking of RLHF/DPO systems and highlights RLHF-induced length bias as a key concern for scalable evaluation of LLMs.

Abstract

The use of large language models (LLMs) as judges, particularly in preference comparisons, has become widespread, but this reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass, where the former is length-independent and related to trustworthiness such as correctness, toxicity, and consistency, and the latter is length-dependent and represents the amount of information in the response. We empirically demonstrated the decomposition through controlled experiments and found that response length impacts evaluations by influencing information mass. To derive a reliable evaluation metric that assesses content quality without being confounded by response length, we propose AdapAlpaca, a simple yet effective adjustment to win rate measurement. Specifically, AdapAlpaca ensures a fair comparison of response quality by aligning the lengths of reference and test model responses under equivalent length intervals.
Paper Structure (53 sections, 3 equations, 20 figures, 16 tables)

This paper contains 53 sections, 3 equations, 20 figures, 16 tables.

Figures (20)

  • Figure 1: Comparison between AlpacaEval and AdapAlpaca (Ours). In AlpacaEval, the reference answer has a fixed length, regardless of the length of the test model's answer. In contrast, AdapAlpaca dynamically selects a reference answer that matches the length of the test model's answer.
  • Figure 2: Validation of desirability's impact on quality for GPT-4. The results demonstrate that desirability influences the win rate.
  • Figure 3: Validation of information mass's impact on quality for GPT-4. The results demonstrate that information mass influences the win rate.
  • Figure 4: Correlation between information mass and word count for responses of GPT-4. As the word count increases, the information mass also increases.
  • Figure 5: Case study on comparing GPT-4 and human vote on AlpacaEval and AdapAlpaca. In AlpacaEval, GPT-4 votes for the verbose answer, but humans vote for the concise reference answer, while in AdapAlpaca, GPT-4 and humans vote for the same answer, demonstrating a better LLM-human alignment on AdapAlpaca.
  • ...and 15 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3