Table of Contents
Fetching ...

MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang

Abstract

Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

MARS: toward more efficient multi-agent collaboration for LLM reasoning

Abstract

Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

Paper Structure

This paper contains 32 sections, 6 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the architecture of MARS and baselines. (a) Basic single model inference (top) and self-reflection (bottom). (b) Self-consistency. (c) Multi-Agent-Debate (MAD). (d) Multi-Agent Review System (MARS). In MARS, the author agent receives a user query and generates an initial response. Each reviewer agent evaluates the response and provides a decision, confidence level, and justification (e.g., reasons for the decision, identified author mistakes). The meta-reviewer integrates review comments and makes the final decision, with suggestions for answer revision. Finally, the author agent incorporates the feedback and updates its response, leading to enhanced reasoning.
  • Figure 2: Accuracy-Resource trade-off across different models on GPQA. MARS demonstrates a significant reduction in token averaged tokens compared to MAD, while achieving higher accuracy than self-consistency.
  • Figure 3: Comparison of MARS and MAD on GPQA with varying number of agents. Row 1: using GPT-3.5-turbo as the backbone; Row 2: using GPT-4o-mini as the backbone. Column 1: accuracy scores; Column 2: averaged number of tokens.
  • Figure 4: Case study of MARS on a GSM example. Upon receiving the user query, the author agent first generates an initial response but incorrectly recomputed a given variable, leading to an incorrect final answer. The reviewers identified the mistake and generated feedback, which guided the author agent to revise the solution, resulting in the correct answer.