Table of Contents
Fetching ...

Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, Zhongyu Wei

TL;DR

Debatrix tackles automated judgment of long, multi-turn debates by combining iterative chronological analysis with multi-dimensional evaluation guided by memory-augmented LLMs. It introduces PanelBench, a benchmark comprising DebateArt and BP-Competition to evaluate debate-judging systems against real outcomes. The framework uses memory for context and analysis, plus specialized judges to generate per-speech, per-debater, and final verdicts, with dimensional collaboration producing a cohesive assessment. Results show that Debatrix outperforms direct LLM prompting and baseline variants, particularly by addressing long-context challenges and leveraging dimension-specific analyses, though issues like position bias and ethical considerations remain areas for further work.

Abstract

How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate. In this paper, by leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration. To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation. Source code and benchmark data are available online at https://github.com/ljcleo/debatrix .

Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

TL;DR

Debatrix tackles automated judgment of long, multi-turn debates by combining iterative chronological analysis with multi-dimensional evaluation guided by memory-augmented LLMs. It introduces PanelBench, a benchmark comprising DebateArt and BP-Competition to evaluate debate-judging systems against real outcomes. The framework uses memory for context and analysis, plus specialized judges to generate per-speech, per-debater, and final verdicts, with dimensional collaboration producing a cohesive assessment. Results show that Debatrix outperforms direct LLM prompting and baseline variants, particularly by addressing long-context challenges and leveraging dimension-specific analyses, though issues like position bias and ethical considerations remain areas for further work.

Abstract

How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate. In this paper, by leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration. To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation. Source code and benchmark data are available online at https://github.com/ljcleo/debatrix .
Paper Structure (36 sections, 5 figures, 19 tables)

This paper contains 36 sections, 5 figures, 19 tables.

Figures (5)

  • Figure 1: An LLM debate judge judging a debate between Alice and Bob. The LLM needs to understand the arguments and how they counter each other (purple bubble); the LLM also needs to evaluate the speeches in multiple dimensions (orange bubble). However, multi-round debates are often long, detracting attention or exceeding the context window (light gray bubble).
  • Figure 2: General structure of Debatrix. ①: add speech to context memory; ②: retrieve relevant pieces of context and analysis; ③: add analysis and reflections to analysis memory; ④: fetch analysis for final judgment. The framework can generate speech/debater judgments and the winner verdict based on analysis from single or multiple dimensions.
  • Figure 3: A more detailed version of an iterative chronological analysis. "JudgeGen." is a shorthand for "judge/verdict generator" that produces judgments and verdicts. Blocks and numbers match the ones in Figure \ref{['fig:framework']}, except the debate analyzer not shown in Figure \ref{['fig:framework']}.
  • Figure 4: Combining multiple dimensional debate analyses into one systematic analysis at the end of the debate. Each dimension possesses a memory for dimensional content analysis, allowing a more nuanced understanding of the debate.
  • Figure 5: Winner prediction RMSE on DebateArt debates with different numbers of speeches (left) or tokens (right). Note that Debatrix (light blue bar) always reaches a relatively low RMSE regardless of the number of speeches or tokens; no other model can achieve this.