Table of Contents
Fetching ...

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

Yiqun Zhang, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song

TL;DR

LLMs struggle with hallucinations and competitiveness in complex, multi-turn debates. The paper introduces Agent4Debate, a dynamic four-agent framework (Searcher, Analyzer, Writer, Reviewer) that coordinates across three debate stages to improve factuality and argumentative quality. It also deploys the Competitive Debate Arena, a public resource of 66 motions evaluated with Debatrix-Elo and Human-Elo rankings, plus ablation studies confirming each component's value. Results show that Agent4Debate can reach human-level performance in Chinese competitive debates, with stronger gains for more capable foundation models and clear potential for multilingual deployment.

Abstract

Competitive debate is a complex task of computational argumentation. Large Language Models (LLMs) suffer from hallucinations and lack competitiveness in this field. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents, involving Searcher, Analyzer, Writer, and Reviewer, dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Competitive Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruit ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.

Can LLMs Beat Humans in Debating? A Dynamic Multi-agent Framework for Competitive Debate

TL;DR

LLMs struggle with hallucinations and competitiveness in complex, multi-turn debates. The paper introduces Agent4Debate, a dynamic four-agent framework (Searcher, Analyzer, Writer, Reviewer) that coordinates across three debate stages to improve factuality and argumentative quality. It also deploys the Competitive Debate Arena, a public resource of 66 motions evaluated with Debatrix-Elo and Human-Elo rankings, plus ablation studies confirming each component's value. Results show that Agent4Debate can reach human-level performance in Chinese competitive debates, with stronger gains for more capable foundation models and clear potential for multilingual deployment.

Abstract

Competitive debate is a complex task of computational argumentation. Large Language Models (LLMs) suffer from hallucinations and lack competitiveness in this field. To address these challenges, we introduce Agent for Debate (Agent4Debate), a dynamic multi-agent framework based on LLMs designed to enhance their capabilities in competitive debate. Drawing inspiration from human behavior in debate preparation and execution, Agent4Debate employs a collaborative architecture where four specialized agents, involving Searcher, Analyzer, Writer, and Reviewer, dynamically interact and cooperate. These agents work throughout the debate process, covering multiple stages from initial research and argument formulation to rebuttal and summary. To comprehensively evaluate framework performance, we construct the Competitive Debate Arena, comprising 66 carefully selected Chinese debate motions. We recruit ten experienced human debaters and collect records of 200 debates involving Agent4Debate, baseline models, and humans. The evaluation employs the Debatrix automatic scoring system and professional human reviewers based on the established Debatrix-Elo and Human-Elo ranking. Experimental results indicate that the state-of-the-art Agent4Debate exhibits capabilities comparable to those of humans. Furthermore, ablation studies demonstrate the effectiveness of each component in the agent structure.
Paper Structure (26 sections, 5 equations, 7 figures, 9 tables)

This paper contains 26 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Before and After: Agent4Debate's impact on LLMs competitive debating skills.
  • Figure 2: Agent for Debate (Agent4Debate) Workflow: A dynamic framework simulating human debate team collaboration. From searching to reviewing, it showcases how four key roles (Searcher, Analyzer, Writer, Reviewer) interact and work iteratively. The right side illustrates the cyclical process from information gathering to argument formation using Stage 1 as an example, highlighting the framework's multi-steps progression and recursive refinement.
  • Figure 3: Predicted Win Rates Using Elo Rankings for Model A in A vs. B Battles.
  • Figure 4: English (translated by Claude-3.5-sonnet from Chinese) case study of the debate motion "Justice is nothing but interest. (Pro side) / Justice is nothing more than interest (Con side)". Pro side is Agent4Debate (GPT-4o), Con side is Agent4Debate (Claude-3.5-sonnet).
  • Figure 5: Chinese case study of the debate motion "Justice is nothing but interest. (Pro side) / Justice is nothing more than interest (Con side)". Pro side is Agent4Debate (GPT-4o), Con side is Agent4Debate (Claude-3.5-sonnet).
  • ...and 2 more figures