Table of Contents
Fetching ...

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Xiao Wang, Keze Wang, Liang Lin

TL;DR

This work addresses the limitations of end-to-end VQA and LLM-assisted approaches by introducing SIRI, a three-agent framework that emulates human top-down reasoning to leverage image-based common-sense knowledge. The Responder, Seeker, and Integrator collaborate to generate answer candidates, identify relevant issues, formulate hypotheses with confidence words, build a Multi-View Knowledge Base, and produce a final answer via score-based voting. Key contributions include the novel multi-agent architecture, an explicit mechanism for external knowledge integration through the MVKB, and demonstrated zero-shot improvements across four diverse VQA datasets without extra training. The approach yields interpretable reasoning traces and shows strong potential for extending to video-based VQA, highlighting a practical path to enhance Vision-Language Models without additional data collection or training.

Abstract

Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than some methods that only utilize VLMs as aids to Large Language Models (LLMs). However, these methods ignore the rich common-sense knowledge inside the given VQA image sampled from the real world. Thus, they cannot fully use the powerful VLM for the given VQA question to achieve optimal performance. Attempt to overcome this limitation and inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of VLMs themselves. Specifically, our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method in the zero-shot setting without extra training cost.

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

TL;DR

This work addresses the limitations of end-to-end VQA and LLM-assisted approaches by introducing SIRI, a three-agent framework that emulates human top-down reasoning to leverage image-based common-sense knowledge. The Responder, Seeker, and Integrator collaborate to generate answer candidates, identify relevant issues, formulate hypotheses with confidence words, build a Multi-View Knowledge Base, and produce a final answer via score-based voting. Key contributions include the novel multi-agent architecture, an explicit mechanism for external knowledge integration through the MVKB, and demonstrated zero-shot improvements across four diverse VQA datasets without extra training. The approach yields interpretable reasoning traces and shows strong potential for extending to video-based VQA, highlighting a practical path to enhance Vision-Language Models without additional data collection or training.

Abstract

Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than some methods that only utilize VLMs as aids to Large Language Models (LLMs). However, these methods ignore the rich common-sense knowledge inside the given VQA image sampled from the real world. Thus, they cannot fully use the powerful VLM for the given VQA question to achieve optimal performance. Attempt to overcome this limitation and inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of VLMs themselves. Specifically, our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method in the zero-shot setting without extra training cost.
Paper Structure (13 sections, 6 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 13 sections, 6 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: The illustration of how humans solve VQA tasks in a top-down reasoning process. When solving a question, humans find relevant issues that can distinguish between the answer candidates. The different answers to the relevant issue make humans select a different answer candidate based on the correlation between the two. For example, cloudy skies mean that it will rain, while clear skies do not. This correlation is summarized by humans through a wealth of world knowledge.
  • Figure 2: Demonstration of how the top-down reasoning process helps VLM answer questions more accurately. Although VLM's powerful capabilities can handle a variety of arbitrary question types, it lacks the correlation between different issues. Our SIRI enhances VLM's ability by injecting this type of connectivity information into VLM via hypothesis.
  • Figure 3: The interaction among agents in our SIRI. The Responder agent receives a question-image pair and generates answer candidates for the given question. Additionally, the Responder agent can also serve as a captioner, providing descriptive content for the image. The Seeker agent leverages the answer candidates, the caption of the image and the capabilities of LLMs to generate relevant issues. With assistance from the Responder agent, the Seeker agent obtains responses for each relevant issue, i.e., the answer candidates for each relevant issue. Subsequently, the Seeker agent aggregates information from the responses, answer candidates, and the image caption, generating hypotheses with their confidence words, which construct a Multi-View Knowledge Base (MVKB). The Integrator agent utilizes the MVKB from the Seeker and the Responder to get the voting pool and further votes to obtain the final answer. In this process, $H$ represents the Hypothesis. Given that there are 2 candidate answers, each relevant issue combined with the original question will generate four hypotheses, namely $H1$, $H2$, $H3$, and $H4$. To better illustrate how these Hypotheses are utilized in the interaction between the Integrator and the Responder agent, i.e., the step of Re-answer Question, we provide a single-step example in Figure \ref{['fig:one_step']}.
  • Figure 4: Multi-View Knowledge Base in Seeker. The illustration highlights the distinctive characteristics of Multi-View Knowledge in response to different questions and the specific visual scene. The knowledge encompasses diverse hypotheses that establish connections with the question and the relevant issues in the given scene.
  • Figure 5: A single-step example where the Responder agent is based on LLaVA-13B and the Seeker agent is based on gpt4o-mini.
  • ...and 3 more figures