Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Xiao Wang, Keze Wang, Liang Lin
TL;DR
This work addresses the limitations of end-to-end VQA and LLM-assisted approaches by introducing SIRI, a three-agent framework that emulates human top-down reasoning to leverage image-based common-sense knowledge. The Responder, Seeker, and Integrator collaborate to generate answer candidates, identify relevant issues, formulate hypotheses with confidence words, build a Multi-View Knowledge Base, and produce a final answer via score-based voting. Key contributions include the novel multi-agent architecture, an explicit mechanism for external knowledge integration through the MVKB, and demonstrated zero-shot improvements across four diverse VQA datasets without extra training. The approach yields interpretable reasoning traces and shows strong potential for extending to video-based VQA, highlighting a practical path to enhance Vision-Language Models without additional data collection or training.
Abstract
Recently, to comprehensively improve Vision Language Models (VLMs) for Visual Question Answering (VQA), several methods have been proposed to further reinforce the inference capabilities of VLMs to independently tackle VQA tasks rather than some methods that only utilize VLMs as aids to Large Language Models (LLMs). However, these methods ignore the rich common-sense knowledge inside the given VQA image sampled from the real world. Thus, they cannot fully use the powerful VLM for the given VQA question to achieve optimal performance. Attempt to overcome this limitation and inspired by the human top-down reasoning process, i.e., systematically exploring relevant issues to derive a comprehensive answer, this work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of VLMs themselves. Specifically, our framework comprises three agents, i.e., Responder, Seeker, and Integrator, to collaboratively answer the given VQA question by seeking its relevant issues and generating the final answer in such a top-down reasoning process. The VLM-based Responder agent generates the answer candidates for the question and responds to other relevant issues. The Seeker agent, primarily based on LLM, identifies relevant issues related to the question to inform the Responder agent and constructs a Multi-View Knowledge Base (MVKB) for the given visual scene by leveraging the build-in world knowledge of LLM. The Integrator agent combines knowledge from the Seeker agent and the Responder agent to produce the final VQA answer. Extensive and comprehensive evaluations on diverse VQA datasets with a variety of VLMs demonstrate the superior performance and interpretability of our framework over the baseline method in the zero-shot setting without extra training cost.
