Table of Contents
Fetching ...

Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

Sugyeong Eo, Hyeonseok Moon, Evelyn Hayoon Zi, Chanjun Park, Heuiseok Lim

TL;DR

The paper tackles the high computational cost and error-propagation risks of multiagent LLM debate systems. It introduces DOWN, an adaptive framework that triggers debate only when the initial response confidence is low, using a confidence-guided, two-round refinement process and either voting or judge-based finalization. Across MUSR and StrategyQA, DOWN achieves up to sixfold efficiency gains while maintaining or improving accuracy, and analysis shows it mitigates cascading errors and generalizes to mixed-model setups and broader domains. The findings demonstrate that selective, confidence-informed debate can deliver high-performance reasoning with substantially reduced resource consumption, offering a scalable alternative to full-debate approaches.

Abstract

Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). Despite improvements in reasoning, the approach introduces substantial computational overhead resulting from iterative agent interactions. Furthermore, engaging in unnecessary debates increases the risk of generating erroneous responses. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates debate based on the confidence score of the agent's initial response. Debate is activated only for queries requiring further deliberation, during which agents refine their outputs by referencing peer responses and associated confidence scores. Evaluations on benchmarks show that DOWN improves efficiency by up to six times while preserving or even outperforming the performance of existing methods. Further analysis indicates that DOWN effectively mitigates the risk of error propagation stemming from the unnecessary debate process. These findings demonstrate the effectiveness of our approach in delivering high-performance LLM solutions at a lower computational cost.

Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

TL;DR

The paper tackles the high computational cost and error-propagation risks of multiagent LLM debate systems. It introduces DOWN, an adaptive framework that triggers debate only when the initial response confidence is low, using a confidence-guided, two-round refinement process and either voting or judge-based finalization. Across MUSR and StrategyQA, DOWN achieves up to sixfold efficiency gains while maintaining or improving accuracy, and analysis shows it mitigates cascading errors and generalizes to mixed-model setups and broader domains. The findings demonstrate that selective, confidence-informed debate can deliver high-performance reasoning with substantially reduced resource consumption, offering a scalable alternative to full-debate approaches.

Abstract

Multiagent collaboration has emerged as a promising framework for enhancing the reasoning capabilities of large language models (LLMs). Despite improvements in reasoning, the approach introduces substantial computational overhead resulting from iterative agent interactions. Furthermore, engaging in unnecessary debates increases the risk of generating erroneous responses. To address these challenges, we propose Debate Only When Necessary (DOWN), an adaptive multiagent debate framework that selectively activates debate based on the confidence score of the agent's initial response. Debate is activated only for queries requiring further deliberation, during which agents refine their outputs by referencing peer responses and associated confidence scores. Evaluations on benchmarks show that DOWN improves efficiency by up to six times while preserving or even outperforming the performance of existing methods. Further analysis indicates that DOWN effectively mitigates the risk of error propagation stemming from the unnecessary debate process. These findings demonstrate the effectiveness of our approach in delivering high-performance LLM solutions at a lower computational cost.

Paper Structure

This paper contains 30 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of accuracy and average agent calls across various multiagent debate methods
  • Figure 2: Overview of the Debate Only When Necessary (DOWN) framework. DOWN consists of four stages: (1) the initial agent generates a response, during which the model's confidence score is extracted. (2) if the confidence score exceeds a threshold value, the response is accepted without debate to improve efficiency, otherwise a multiagent debate is activated. (3) agents refine their responses by referencing peer outputs and associated confidence scores. (4) the final answer is selected via majority voting or designated judge agent.
  • Figure 3: Comparison of multiagent debate system performance in a mixed-model configuration. The configuration includes Llama3.3-70B, Qwen-2.5 72B, and GPT-4o-mini, with the model order randomized for each query. For single model-based approaches, we present the results of GPT-4o-mini.
  • Figure 4: Accuracy and average agent calls (AC) of multiagent debate methods across six MMLU domains
  • Figure 5: Qualitative analysis of the MUSR dataset