An Empirical Analysis on Large Language Models in Debate Evaluation

Xinyi Liu; Pinxin Liu; Hangfeng He

An Empirical Analysis on Large Language Models in Debate Evaluation

Xinyi Liu, Pinxin Liu, Hangfeng He

TL;DR

The paper tackles automated debate evaluation by empirically analyzing large language models, notably GPT-3.5 and GPT-4, using a prompt-based evaluation template and varied verbalizers to study performance and biases. It demonstrates that these LLMs can match or exceed human evaluators and surpass state-of-the-art fine-tuned baselines, while also revealing prompt-design–driven biases such as positional, lexical, and end-of-discussion effects. The study systematically dissects how label choices and answer order influence judgments and provides a balanced/unbalanced data framework to separate model biases from dataset biases. The findings underscore the importance of careful prompt engineering and bias mitigation for reliable, fair automated debate evaluation, with practical implications for deploying LLM-based evaluators in research and applied settings.

Abstract

In this study, we investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We discover that LLM's performance exceeds humans and surpasses the performance of state-of-the-art methods fine-tuned on extensive datasets in debate evaluation. We additionally explore and analyze biases present in LLMs, including positional bias, lexical bias, order bias, which may affect their evaluative judgments. Our findings reveal a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented, attributed to prompt design. We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential, highlighting the critical need for careful label verbalizer selection in prompt design. Additionally, our analysis indicates a tendency of both models to favor the debate's concluding side as the winner, suggesting an end-of-discussion bias.

An Empirical Analysis on Large Language Models in Debate Evaluation

TL;DR

Abstract

Paper Structure (36 sections, 10 figures, 18 tables)

This paper contains 36 sections, 10 figures, 18 tables.

Introduction
Methodology
LLMs' capability for debate evaluation.
LLMs' biases in debate evaluation.
Experiments
Dataset.
Evaluation metrics.
Models.
Human annotation.
Results and Analysis
LLMs’ Performance
Biases Analysis
Positional Bias.
Lexical Bias.
Order Bias.
...and 21 more sections

Figures (10)

Figure 1: Large Language Models presents various biases during the evaluation of long debates.
Figure 2: The observed positional bias in GPT-3.5 is evident through the alteration in the proportion of Predicted Con outcomes, which increases when Con is positioned as the second candidate response compared to its placement as the first. This consistent preference across all label configurations suggests a systematic positional bias favoring the second candidate, underscoring the model's sensitivity to the order in which options are presented.
Figure 3: Each subfigure's legend delineates the Pro/Con label set across different verbalizer configurations. GPT-3.5 demonstrates a consistent lexical bias, which persists across shuffled positions aimed at counteracting positional bias, and in settings where Pro consistently precedes Con, except for the insignificant bias within P/C.
Figure 4: This figure illustrates the impact of positional bias on GPT-4 through the changes in the proportion of Predicted Con, shifting from when Con is fixed as the first candidate response to when it is positioned as the second. GPT-4 exhibits a positional bias towards the second candidate presented across all label set configurations.
Figure 5: This figure illustrates the impact of lexical bias on GPT-4 through the changes in the portion of Predicted Con from switching the verbalizers for Pro and Con.
...and 5 more figures

An Empirical Analysis on Large Language Models in Debate Evaluation

TL;DR

Abstract

An Empirical Analysis on Large Language Models in Debate Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)