Table of Contents
Fetching ...

Multi-Agent Debate with Memory Masking

Hongduan Tian, Xiao Feng, Ziyuan Zhao, Xiangyu Zhu, Rolan Yan, Bo Han

Abstract

Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M$^2$), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M$^2$ can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M$^2$ can identify the erroneous memories and achieve better performance in reasoning than MAD.

Multi-Agent Debate with Memory Masking

Abstract

Large language models (LLMs) have recently demonstrated impressive capabilities in reasoning tasks. Currently, mainstream LLM reasoning frameworks predominantly focus on scaling up inference-time sampling to enhance performance. In particular, among all LLM reasoning frameworks, *multi-agent debate* (MAD), which employs multiple LLMs as agents to perform reasoning in the way of multi-round debate, has emerged as a powerful reasoning paradigm since it allows agents to access previous memories to alleviate fallacious content and refine their reasoning iteratively in each debate round. However, although MAD significantly improves the reasoning capabilities of LLMs, in this paper, we observe that there remain erroneous memories, and LLM agents are vulnerable to these erroneous memories. To explore this phenomenon, we provide a theoretical insight that the performance of MAD is highly dependent on the quality of memories derived from the previous debate, indicating that the existence of erroneous memories poses a threat to the performance of MAD. To address this problem, we introduce a simple yet effective multi-agent debate framework, *multi-agent debate with memory masking* (MAD-M), to improve the robustness of MAD by allowing LLM agents to mask erroneous memories from the previous debate round at the beginning of each debate round. In this way, MAD-M can polish the contextual information before each debate round by preserving informative and meaningful memories while discarding the erroneous memories. Extensive experiments and analyses on mainstream mathematical and logical reasoning benchmarks demonstrate that MAD-M can identify the erroneous memories and achieve better performance in reasoning than MAD.
Paper Structure (30 sections, 2 theorems, 24 equations, 10 figures, 3 tables)

This paper contains 30 sections, 2 theorems, 24 equations, 10 figures, 3 tables.

Key Result

Proposition 2.2

Consider a total number of $N_{\rm sc}$ independent responses generated in the way of CoT-SC. With Assumption assumption, the probability that the final answer is correct is bounded by: The corresponding lower bound and upper bound of cases $p<\frac{1}{2}$ and $p\geq\frac{1}{2}$ are $0$ and $1$, respectively.

Figures (10)

  • Figure 1: An illustration of the effects of erroneous memories. The example is a real case picked from the MATH dataset. In MAD, all memories in the previous debate round are considered in the next debate round. However, the memories from the previous round may include erroneous reasoning responses (cf. the responses of debate round 1 in the figure). The erroneous memory may misguide the agent, which was correct, and result in the wrong final answer (cf. Agent 1 in the debate round 2).
  • Figure 2: An illustration of MAD-M$^2$ framework. In general, MAD-M$^2$ mainly includes three steps. (i) In the initial debate round, LLM agents independently generate responses based on the given query. (ii) The responses generated in the previous round are treated as memories. All memories will be critically evaluated and the potential erroneous memories will be masked for the reasoning in the next debate round. (iii) With the preserved memories, agents perform reasoning in the next debate.
  • Figure 3: Visualization of erroneous memory identification of different LLMs. We here examine the erroneous memory identification capability of different LLMs. "S" denotes the strict rule, and "L" denotes the loose rule. According to the results, the objective masking strategy generally works better on the powerful LLMs, while the subjective masking works better on relatively weak LLMs.
  • Figure 4: Effect of scaling the number of agents in the case of Qwen2.5-7B-Instruct. The number of agents is increased from 3 to 10. According to the figures, both frameworks benefit from the increase of the number of agents and MAD-M$^2$ (S) tends to achieve better performance in most cases.
  • Figure 5: Effect of scaling the number of agents in the case of DeepSeek-Math-7B-Instruct. The number of agents is increased from 3 to 10. According to the figures, both frameworks act differently when the number of agents increases and MAD-M$^2$ (O) achieves better performance in most cases.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Remark 1
  • Proposition 2.2: CoT-SC
  • Proposition 2.3: MAD
  • Remark 2
  • proof
  • proof