Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao, Jian Jia, Longteng Guo, Qunbo Wang, Te Yang, Yan Li, Yanhua Cheng, Bo Wang, Quan Chen, Han Li, Jing Liu
TL;DR
This work tackles KB-VQA by confronting the problem of noisy, lengthy retrieved knowledge. It introduces two synergistic components: a Knowledge Condensation model that uses a visual-language model to extract concise knowledge concepts and a large language model to summarize knowledge essence, and a Knowledge Reasoning model that encodes visual context, questions, condensed knowledge, and implicit cues to generate answers. The approach yields state-of-the-art results on OK-VQA (65.1%) and A-OKVQA (60.1%) without GPT-3, with thorough ablations validating the benefits of condensation and reasoning fusion. By effectively filtering irrelevant information and leveraging both multimodal and textual reasoning, the method demonstrates robust knowledge integration for KB-VQA with practical implications for real-world visual question answering systems.
Abstract
Knowledge-based visual question answering (KB-VQA) is a challenging task, which requires the model to leverage external knowledge for comprehending and answering questions grounded in visual content. Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions. However, these retrieved knowledge passages often contain irrelevant or noisy information, which limits the performance of the model. To address the challenge, we propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model. We condense the retrieved knowledge passages from two perspectives. First, we leverage the multimodal perception and reasoning ability of the visual-language models to distill concise knowledge concepts from retrieved lengthy passages, ensuring relevance to both the visual content and the question. Second, we leverage the text comprehension ability of the large language models to summarize and condense the passages into the knowledge essence which helps answer the question. These two types of condensed knowledge are then seamlessly integrated into our Knowledge Reasoning model, which judiciously navigates through the amalgamated information to arrive at the conclusive answer. Extensive experiments validate the superiority of the proposed method. Compared to previous methods, our method achieves state-of-the-art performance on knowledge-based VQA datasets (65.1% on OK-VQA and 60.1% on A-OKVQA) without resorting to the knowledge produced by GPT-3 (175B).
