Table of Contents
Fetching ...

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta

TL;DR

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective introduces MAMMQA, a three-stage, prompt-driven framework that uses modality-specific experts, cross-modal synthesizers, and a consensus aggregator to answer questions over text, tables, and images without fine-tuning. The approach decouples evidence extraction from cross-modal integration and final adjudication, improving interpretability and reducing hallucinations through explicit, traceable reasoning traces. Empirically, MAMMQA achieves state-of-the-art zero-shot performance on MultiModalQA and ManyModalQA across both proprietary and open-source LLMs, with strong robustness to noisy contexts and mislabels. The work demonstrates that structured, multi-agent prompting can match or exceed finetuned baselines, offering scalable, reusable reasoning architectures for real-world MMQA tasks.

Abstract

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

TL;DR

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective introduces MAMMQA, a three-stage, prompt-driven framework that uses modality-specific experts, cross-modal synthesizers, and a consensus aggregator to answer questions over text, tables, and images without fine-tuning. The approach decouples evidence extraction from cross-modal integration and final adjudication, improving interpretability and reducing hallucinations through explicit, traceable reasoning traces. Empirically, MAMMQA achieves state-of-the-art zero-shot performance on MultiModalQA and ManyModalQA across both proprietary and open-source LLMs, with strong robustness to noisy contexts and mislabels. The work demonstrates that structured, multi-agent prompting can match or exceed finetuned baselines, offering scalable, reusable reasoning architectures for real-world MMQA tasks.

Abstract

Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

Paper Structure

This paper contains 21 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Depicting Illustration for our proposed MAMMQA , with three agents: 1) Modality Expert, that extracts modality specific insights; 2) Cross Modal Systhesis Agent, that synchronises information across modalities with insights from Modality Expert; 3) Aggregator Agent, that ground the answer using extracted cross modal information.
  • Figure 2: Aggregator Agent performance with and without question on MultiModalQA Dataset.
  • Figure 3: Modality Expert Agent Prompt
  • Figure 4: Cross Modality Agent Prompt
  • Figure 5: Aggregator Prompt