Table of Contents
Fetching ...

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

Elham J. Barezi, Parisa Kordjamshidi

TL;DR

This study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it and demonstrates the positive impact of using simple questions before retrieving visual or non-visual information.

Abstract

We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

TL;DR

This study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it and demonstrates the positive impact of using simple questions before retrieving visual or non-visual information.

Abstract

We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.

Paper Structure

This paper contains 14 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An example from the A-OKVQA dataset (green box). Given an image and a question, we must find the final answer by extracting and integrating visual and external knowledge (as shown in the blue box).
  • Figure 2: An example from the A-OKVQA dataset with their automatically generated captions.
  • Figure 3: The Diagram of our model for question decomposition, type checking, information extraction, and final context collection.
  • Figure 4: General architecture of using LLMs for VQA model. The context extracted by architecture in Fig \ref{['fig:arch']} will be used in this few-shot LLM model to find the final answer.
  • Figure 5: An example from A-OKVQA dataset with the image taken from COCO dataset1
  • ...and 1 more figures