Table of Contents
Fetching ...

WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng

TL;DR

This paper investigates robust multimodal reasoning through multi-agent debate by introducing WISE, a modular framework that assigns agents to Solver and Reflector roles and uses an orchestrator to coordinate iterative reasoning. It extends the Dawid–Skene aggregation to jointly estimate solver and reflector error models, enabling principled consensus across debate rounds. Across Vision-Language reasoning benchmarks (SMART-840, VisualPuzzles, EvoChart-QA, and SMART-840++), WISE achieves 2–7% accuracy gains over state-of-the-art MAD approaches, demonstrating the value of heterogeneous agent roles and two-stage feedback. The work contributes a scalable MAD architecture for multimodal tasks and a probabilistic aggregation method that improves robustness, with implications for zero-shot reasoning and ensemble design in multimodal AI systems.

Abstract

Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

TL;DR

This paper investigates robust multimodal reasoning through multi-agent debate by introducing WISE, a modular framework that assigns agents to Solver and Reflector roles and uses an orchestrator to coordinate iterative reasoning. It extends the Dawid–Skene aggregation to jointly estimate solver and reflector error models, enabling principled consensus across debate rounds. Across Vision-Language reasoning benchmarks (SMART-840, VisualPuzzles, EvoChart-QA, and SMART-840++), WISE achieves 2–7% accuracy gains over state-of-the-art MAD approaches, demonstrating the value of heterogeneous agent roles and two-stage feedback. The work contributes a scalable MAD architecture for multimodal tasks and a probabilistic aggregation method that improves robustness, with implications for zero-shot reasoning and ensemble design in multimodal AI systems.

Abstract

Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

Paper Structure

This paper contains 45 sections, 18 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: An illustration of WISE message passing on a problem from the SMART-840 dataset.
  • Figure 2: WISE architecture demonstrating the control flow, agent settings, feedback, responses, and the WISE-Dawid-Skene solution aggregation scheme to produce the final response.
  • Figure 3: Example problems from our proposed SMART-840++ dataset. Cells 1-5 show the same puzzle but with increasing difficulties. Cells 6-8 show puzzle instances (of grade 1-2) of the same difficulty level (1) but with different configurations. Cells 9-10 show diversity over instances for a puzzle from grade 7-8. Last row shows an ensemble of puzzles across grades of the highest difficulty level.
  • Figure 4: (a) shows performance on the EvoChart-QA dataset. (b) plots the cumulative and average accuracy for varied configurations against the number of debate rounds, and (c) plots the average number of rounds by WISE.
  • Figure 5: A SMART-840 example and its summarized WISE debate. We show a four-round WISE debate using Claude-Sonnet-3.7, GPT-4.1, and Gemma-3. The solver predictions and reflector-assigned weights are shown (e.g., C/0). Although the correct answer D does not appear in Round 1, it emerges in later rounds. While the debate ranks C highest, the WISE-DS aggregation step—using each agent’s error-probability matrices—correctly recovers D via posterior inference. Additional examples and full debate transcripts are provided in the Appendix.
  • ...and 12 more figures