Table of Contents
Fetching ...

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Nithin Sivakumaran, Justin Chih-Yao Chen, David Wan, Yue Zhang, Jaehong Yoon, Elias Stengel-Eskin, Mohit Bansal

TL;DR

DART introduces a disagreement-driven, multi-agent framework that recruits specialized vision tools to resolve perceptual and reasoning disagreements in multimodal question answering. By integrating tool outputs and tool-aligned agreement scores into a multi-stage pipeline—initial answers, disagreement resolution, agreement scoring, discussion, and aggregation—DART achieves consistent improvements over strong baselines across A-OKVQA, MMMU, NaturalBench, and M3D. The approach demonstrates adaptability to new domains by incorporating domain-specific tools and yields richer, more diverse discussions than prior multi-agent methods. These findings highlight the practical value of coupling diverse vision tools with debate-style reasoning to enhance multimodal understanding and robustness.

Abstract

Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

TL;DR

DART introduces a disagreement-driven, multi-agent framework that recruits specialized vision tools to resolve perceptual and reasoning disagreements in multimodal question answering. By integrating tool outputs and tool-aligned agreement scores into a multi-stage pipeline—initial answers, disagreement resolution, agreement scoring, discussion, and aggregation—DART achieves consistent improvements over strong baselines across A-OKVQA, MMMU, NaturalBench, and M3D. The approach demonstrates adaptability to new domains by incorporating domain-specific tools and yields richer, more diverse discussions than prior multi-agent methods. These findings highlight the practical value of coupling diverse vision tools with debate-style reasoning to enhance multimodal understanding and robustness.

Abstract

Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.

Paper Structure

This paper contains 39 sections, 1 equation, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Previous work has explored using (A) multiple agents in debate to refine their reasoning, but this approach is limited to the abilities of the agents. Alternatively, some methods employ a (B) top‑down LLM agent that invokes vision tools, yet they plan tool usage based solely on the question and overlook the visual information itself. In our method (C), we facilitate a discussion among multiple agents with targeted intervention from a pool of vision tools. These tools address disagreements detected in a debate of VLM agents, with their specialized vision outputs and agreement scores being used for future discussion.
  • Figure 2: Overview of DART. We start with (1) Initial Answer Generation from a set of answering/reasoning agents. This is followed by (2) Tool-Based Disagreement Resolution and (3) Agreement Scoring. The newly generated tool outputs and agreement scores are incorporated into the (4) Discussion and (5) Aggregation phases.
  • Figure 3: Breakdown of total tool calling of DART on A-OKVQA.
  • Figure 4: Performance of DART and multi-agent debate over three rounds. The error bars indicate the standard deviation among the individual answering agents.
  • Figure 5: Qualitative Example for DART. We have input question "When does meter enforcement have their days off?" with gold answer Weekends. Ovis is the only model to get it correct, with MiniCPM-o jumping to an improper conclusion and QwenVL being objectively incorrect. The recruiter identifies this disagreement on what text is said on the meter and calls on the OCR tool to resolve the disagreement. The OCR tools correctly identifies the text as M-F 9am-6pm. As a result, the models and aggregators are able to get to the correct answer in subsequent steps.
  • ...and 1 more figures