Evaluating the Performance of Large Language Models via Debates

Behrad Moniri; Hamed Hassani; Edgar Dobriban

Evaluating the Performance of Large Language Models via Debates

Behrad Moniri, Hamed Hassani, Edgar Dobriban

TL;DR

The paper introduces an automated, debate-based benchmarking framework to evaluate and rank large language models. It formalizes a multi-round debate on predefined topics, with a judge LLM assessing arguments and determining winners, thereby capturing domain knowledge, reasoning, and inconsistency detection while avoiding costly human crowdsourcing. Empirical results show rankings that align with human-based benchmarks and existing leaderboards, and robustness checks indicate that different judge models yield similar model hierarchies. The approach demonstrates scalable, domain-agnostic model evaluation, while acknowledging limitations related to topic selection, judge strength, and language scope.

Abstract

Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications, or rely on human input, making them unscalable. To address these issues, we propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.

Evaluating the Performance of Large Language Models via Debates

TL;DR

Abstract

Paper Structure (75 sections, 3 figures, 2 tables)

This paper contains 75 sections, 3 figures, 2 tables.

Introduction
Evaluation via Debates
Related Works
Debate Framework
Algorithm Details
The Choice of the Judge LLM
Experimental Results
Rankings
Other Experiments
Analysis of Content.
Human as Judge.
Llama-3 as Judge.
Conclusion
Limitations
List of Topics.
...and 60 more sections

Figures (3)

Figure 1: A snippet of debates. Two language models engage in debates on a list of topics, and a judge model announces the winner for each topic. The language model with the most wins across all topics is declared the overall winner.
Figure 2: Overall ranking of LLMs with GPT-4 as judge.
Figure 3: The percentage of the times each model won against another model for each of the six reasons.

Evaluating the Performance of Large Language Models via Debates

TL;DR

Abstract

Evaluating the Performance of Large Language Models via Debates

Authors

TL;DR

Abstract

Table of Contents

Figures (3)