Table of Contents
Fetching ...

A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning

Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, Qing Li

TL;DR

This work addresses multimodal reasoning with multiple perspective agents by introducing Blueprint Debate on Graphs (BDoG), a top-down, deductive framework that confines debates to a blueprint graph and stores evidence in branches to prevent opinion trivialization and distraction from image-derived concepts. BDoG initializes a compact blueprint graph from multimodal inputs, then uses Proponent, Opponent, and Moderator agents to iteratively refine and condense the graph via graph condensation, with termination based on a distance criterion $\|\mathcal{G}^{i+1}-\mathcal{G}^i\| \leq \epsilon$. Empirical results on ScienceQA-IMG and MMBench show consistent, significant gains across backbones, achieving state-of-the-art performance (e.g., GeminiProVision + BDoG reaching around 81% on both benchmarks) and narrowing the gap between small and large models. The work also provides insights into the interplay between debate dynamics and graph-based reasoning, highlighting improvements in explainability and efficiency through structured, evidence-grounded discussions.

Abstract

This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate that BDoG is able to achieve state-of-the-art results in ScienceQA and MMBench with significant improvements over previous methods. The source code can be accessed at https://github.com/thecharm/BDoG.

A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning

TL;DR

This work addresses multimodal reasoning with multiple perspective agents by introducing Blueprint Debate on Graphs (BDoG), a top-down, deductive framework that confines debates to a blueprint graph and stores evidence in branches to prevent opinion trivialization and distraction from image-derived concepts. BDoG initializes a compact blueprint graph from multimodal inputs, then uses Proponent, Opponent, and Moderator agents to iteratively refine and condense the graph via graph condensation, with termination based on a distance criterion . Empirical results on ScienceQA-IMG and MMBench show consistent, significant gains across backbones, achieving state-of-the-art performance (e.g., GeminiProVision + BDoG reaching around 81% on both benchmarks) and narrowing the gap between small and large models. The work also provides insights into the interplay between debate dynamics and graph-based reasoning, highlighting improvements in explainability and efficiency through structured, evidence-grounded discussions.

Abstract

This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate that BDoG is able to achieve state-of-the-art results in ScienceQA and MMBench with significant improvements over previous methods. The source code can be accessed at https://github.com/thecharm/BDoG.
Paper Structure (32 sections, 10 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison results from ScienceQA dataset of direct answer from MLLM, Multimodal Chain-of-Thought (CoT), Multi-agent Debate (MAD) and our Blueprint Debate on Graph (BDoG). BDoG confines debates to a blueprint and stores evidence in graph branches, which mitigates word-level opinion trivialization and distractions caused by irrelevant concepts.
  • Figure 2: Comparison of CoT, Duty-Distinct CoT (DDCoT), Self-Correction, Multi-agent Debate (MAD) and Our proposed Blueprint Debate on Graph (BDoG). Q: input question, I: input image, C: context or hint, A: answer, R: rationale, G: blueprint.
  • Figure 3: Case study of our proposed Blueprint Debate on Graph (BDoG) and vallina Multi-agent Debate (BDoG$^{Debate}$) on ScienceQA-IMG (left) and MMBench (right) datasets. Green color indicates the correct answer/rationale and Red means incorrect/irrelevant predictions.
  • Figure 4: Statistics of intra-round (left) and inter-round (right) Blueprint condensation of BDoG with GeminiProVision for ScienceQA-IMG test set. #Update: number of updated attributes; #Prune: number of pruned entities/relations; #Add: number of newly-added entities/relations.
  • Figure 5: Effectiveness vs. efficiency results, comparing our proposed Blueprint Debate-on-Graph (BDoG) and vanilla Multi-agent Debate (BDoG (Debate)) on GeminiProVision. The bar chart indicates the inference time on three datasets and lines indicate the zero-shot performance (Accuracy).
  • ...and 7 more figures