Table of Contents
Fetching ...

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Ziyue Wang, Chi Chen, Peng Li, Yang Liu

TL;DR

This work tackles the information gap that arises when images are converted to text for LLM-based visual question answering by enabling LLMs to proactively query a Vision-Language Model to uncover missing image details. The authors introduce a three-module framework—inquiry, refinement, and answering—where the LLM generates additional questions about the image, a refinement module filters and consolidates useful information, and the LLMReasoning module produces final answers using augmented image data. Empirical results on OK-VQA and A-OKVQA show consistent improvements over strong baselines, with an average gain of 2.15% on OK-VQA and robust performance across multiple LLMs and settings. The approach demonstrates the value of interactive information gathering and refinement to enhance vision-language reasoning, with potential applicability to a broad range of VL tasks beyond VQA.

Abstract

Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge not only in natural language tasks, but also in some vision-language tasks such as open-domain knowledge-based visual question answering (OK-VQA). As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. This leads to discrepancies between images and their textual representations presented to LLMs, which consequently impedes final reasoning performance. To fill the information gap and better leverage the reasoning capability, we design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image, along with filters for refining the generated information. We validate our idea on OK-VQA and A-OKVQA. Our method continuously boosts the performance of baselines methods by an average gain of 2.15% on OK-VQA, and achieves consistent improvements across different LLMs.

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

TL;DR

This work tackles the information gap that arises when images are converted to text for LLM-based visual question answering by enabling LLMs to proactively query a Vision-Language Model to uncover missing image details. The authors introduce a three-module framework—inquiry, refinement, and answering—where the LLM generates additional questions about the image, a refinement module filters and consolidates useful information, and the LLMReasoning module produces final answers using augmented image data. Empirical results on OK-VQA and A-OKVQA show consistent improvements over strong baselines, with an average gain of 2.15% on OK-VQA and robust performance across multiple LLMs and settings. The approach demonstrates the value of interactive information gathering and refinement to enhance vision-language reasoning, with potential applicability to a broad range of VL tasks beyond VQA.

Abstract

Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge not only in natural language tasks, but also in some vision-language tasks such as open-domain knowledge-based visual question answering (OK-VQA). As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. This leads to discrepancies between images and their textual representations presented to LLMs, which consequently impedes final reasoning performance. To fill the information gap and better leverage the reasoning capability, we design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image, along with filters for refining the generated information. We validate our idea on OK-VQA and A-OKVQA. Our method continuously boosts the performance of baselines methods by an average gain of 2.15% on OK-VQA, and achieves consistent improvements across different LLMs.
Paper Structure (32 sections, 10 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: An example of our framework (B) compared to baselines (A). In the two methods in (A), the caption models do not provide precise information of "what is being stepped over", resulting in hallucinated answers. Our method (B) empowers the LLM to actively seek and acquire missing information by querying the VLM.
  • Figure 2: Our proposed framework consists of three modules. First, in the inquiry module (§\ref{['sec:gen']}), we prompt the LLM to generate new questions for the missing image information required to answer the original question, and obtain answers from a VLM. Then, a refinement module (§\ref{['sec:reward']}) is adopted to summarize the questions and answers, filtering and extracting useful information from them. Finally, in the answering module (§\ref{['sec:reason']}), the LLM is prompted to predict the final answer with the augmented image information.
  • Figure 3: Cases compared to Prophet and PromptCap without applying our method. The frames titled by "PromptCap"/"Prophet" depict the results given by these two baselines in our reproduced version. The information leading to incorrect answers are marked in red.