Table of Contents
Fetching ...

Knowledge Condensation and Reasoning for Knowledge-based VQA

Dongze Hao, Jian Jia, Longteng Guo, Qunbo Wang, Te Yang, Yan Li, Yanhua Cheng, Bo Wang, Quan Chen, Han Li, Jing Liu

TL;DR

This work tackles KB-VQA by confronting the problem of noisy, lengthy retrieved knowledge. It introduces two synergistic components: a Knowledge Condensation model that uses a visual-language model to extract concise knowledge concepts and a large language model to summarize knowledge essence, and a Knowledge Reasoning model that encodes visual context, questions, condensed knowledge, and implicit cues to generate answers. The approach yields state-of-the-art results on OK-VQA (65.1%) and A-OKVQA (60.1%) without GPT-3, with thorough ablations validating the benefits of condensation and reasoning fusion. By effectively filtering irrelevant information and leveraging both multimodal and textual reasoning, the method demonstrates robust knowledge integration for KB-VQA with practical implications for real-world visual question answering systems.

Abstract

Knowledge-based visual question answering (KB-VQA) is a challenging task, which requires the model to leverage external knowledge for comprehending and answering questions grounded in visual content. Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions. However, these retrieved knowledge passages often contain irrelevant or noisy information, which limits the performance of the model. To address the challenge, we propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model. We condense the retrieved knowledge passages from two perspectives. First, we leverage the multimodal perception and reasoning ability of the visual-language models to distill concise knowledge concepts from retrieved lengthy passages, ensuring relevance to both the visual content and the question. Second, we leverage the text comprehension ability of the large language models to summarize and condense the passages into the knowledge essence which helps answer the question. These two types of condensed knowledge are then seamlessly integrated into our Knowledge Reasoning model, which judiciously navigates through the amalgamated information to arrive at the conclusive answer. Extensive experiments validate the superiority of the proposed method. Compared to previous methods, our method achieves state-of-the-art performance on knowledge-based VQA datasets (65.1% on OK-VQA and 60.1% on A-OKVQA) without resorting to the knowledge produced by GPT-3 (175B).

Knowledge Condensation and Reasoning for Knowledge-based VQA

TL;DR

This work tackles KB-VQA by confronting the problem of noisy, lengthy retrieved knowledge. It introduces two synergistic components: a Knowledge Condensation model that uses a visual-language model to extract concise knowledge concepts and a large language model to summarize knowledge essence, and a Knowledge Reasoning model that encodes visual context, questions, condensed knowledge, and implicit cues to generate answers. The approach yields state-of-the-art results on OK-VQA (65.1%) and A-OKVQA (60.1%) without GPT-3, with thorough ablations validating the benefits of condensation and reasoning fusion. By effectively filtering irrelevant information and leveraging both multimodal and textual reasoning, the method demonstrates robust knowledge integration for KB-VQA with practical implications for real-world visual question answering systems.

Abstract

Knowledge-based visual question answering (KB-VQA) is a challenging task, which requires the model to leverage external knowledge for comprehending and answering questions grounded in visual content. Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions. However, these retrieved knowledge passages often contain irrelevant or noisy information, which limits the performance of the model. To address the challenge, we propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model. We condense the retrieved knowledge passages from two perspectives. First, we leverage the multimodal perception and reasoning ability of the visual-language models to distill concise knowledge concepts from retrieved lengthy passages, ensuring relevance to both the visual content and the question. Second, we leverage the text comprehension ability of the large language models to summarize and condense the passages into the knowledge essence which helps answer the question. These two types of condensed knowledge are then seamlessly integrated into our Knowledge Reasoning model, which judiciously navigates through the amalgamated information to arrive at the conclusive answer. Extensive experiments validate the superiority of the proposed method. Compared to previous methods, our method achieves state-of-the-art performance on knowledge-based VQA datasets (65.1% on OK-VQA and 60.1% on A-OKVQA) without resorting to the knowledge produced by GPT-3 (175B).
Paper Structure (16 sections, 5 equations, 7 figures, 11 tables)

This paper contains 16 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison with previous methods. (a) Previous methods gui2021katlin2022revivelin2022retrieval convert the images to visual contexts (captions) and send them to the LLM along with the questions and retrieved knowledge passages to predict the answers. Due to the retrieved knowledge passages contain many noisy information, they mislead the model to predict the wrong answer "surfing". (b) We leverage the trainable VLM and frozen LLM to condense lengthy knowledge passages into concise knowledge concepts and knowledge essence to mitigate the interference of noisy information. With the condensed knowledge, our knowledge reasoning model generates the right answer.
  • Figure 2: The structure of the knowledge condensation model. The knowledge condensation model consists of a visual-language model (VLM) and a large-language model (LLM). The VLM takes the image, question, and each retrieved passage as inputs and is trained by the supervision of the ground-truth answer. By utilizing the multimodal perception reasoning ability of VLM, each knowledge is condensed into the knowledge concept. The LLM takes the visual context, question, and each retrieved passage as inputs, we directly prompt LLM to condense each knowledge passage into the knowledge essence.
  • Figure 3: The structure of the knowledge reasoning model. (a) We concatenate the visual context, question, the condensed knowledge concepts and essence, and the implicit knowledge as a sentence and encode these information, then the decoder generates the final answer. (b) We concatenate the visual context and question with different types of knowledge as different sentences. We encode these sentences into different embeddings and they are concatenated into the decoder for generating the answer.
  • Figure 4: Case study of our method. The condensed knowledge are distilled valid information from retrieved knowledge passages by the knowledge condensation model. The implicit knowledge and corresponding scores (in brackets) are produced by a pre-trained VQA classification model MCAN. The condensed knowledge can provide key information to help the model to reason the answer. By introducing extra implicit knowledge, the knowledge reasoning model can reason over more knowledge bases to select the right knowledge for answering the question.
  • Figure 5: Qualitative comparison between using knowledge passages and knowledge concepts&essence. The blue box corresponds the original retrieved knowledge passages. The yellow box corresponds the condensed knowledge concepts and essence.
  • ...and 2 more figures