Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models
Marc Oriol, Quim Motger, Jordi Marco, Xavier Franch
TL;DR
This work investigates using Multi-Agent Debate (MAD) to improve the accuracy of LLM-based AI agents in Requirements Engineering (RE). It systematically maps MAD strategies across domains to build a taxonomy of debate characteristics (participants, interaction, and agreement) and assesses applicability to RE through a three-agent MAD (two debaters and a judge) for binary RE classification on the PROMISE dataset. Results show MAD can substantially boost accuracy and F1, especially without inter-debater interaction (n=0), but at a significantly higher cost in tokens, time, and money; statistical tests confirm the gains are unlikely due to chance. The study provides a foundational framework and plan for expanding MAD evaluations across more RE tasks, additional datasets, and diverse LLMs, with attention to trust, fairness, and scalability considerations.
Abstract
Context: Large Language Model (LLM) agents are becoming widely used for various Requirements Engineering (RE) tasks. Research on improving their accuracy mainly focuses on prompt engineering, model fine-tuning, and retrieval augmented generation. However, these methods often treat models as isolated black boxes - relying on single-pass outputs without iterative refinement or collaboration, limiting robustness and adaptability. Objective: We propose that, just as human debates enhance accuracy and reduce bias in RE tasks by incorporating diverse perspectives, different LLM agents debating and collaborating may achieve similar improvements. Our goal is to investigate whether Multi-Agent Debate (MAD) strategies can enhance RE performance. Method: We conducted a systematic study of existing MAD strategies across various domains to identify their key characteristics. To assess their applicability in RE, we implemented and tested a preliminary MAD-based framework for RE classification. Results: Our study identified and categorized several MAD strategies, leading to a taxonomy outlining their core attributes. Our preliminary evaluation demonstrated the feasibility of applying MAD to RE classification. Conclusions: MAD presents a promising approach for improving LLM accuracy in RE tasks. This study provides a foundational understanding of MAD strategies, offering insights for future research and refinements in RE applications.
