Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions
Ziyue Wang, Chi Chen, Peng Li, Yang Liu
TL;DR
This work tackles the information gap that arises when images are converted to text for LLM-based visual question answering by enabling LLMs to proactively query a Vision-Language Model to uncover missing image details. The authors introduce a three-module framework—inquiry, refinement, and answering—where the LLM generates additional questions about the image, a refinement module filters and consolidates useful information, and the LLMReasoning module produces final answers using augmented image data. Empirical results on OK-VQA and A-OKVQA show consistent improvements over strong baselines, with an average gain of 2.15% on OK-VQA and robust performance across multiple LLMs and settings. The approach demonstrates the value of interactive information gathering and refinement to enhance vision-language reasoning, with potential applicability to a broad range of VL tasks beyond VQA.
Abstract
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge not only in natural language tasks, but also in some vision-language tasks such as open-domain knowledge-based visual question answering (OK-VQA). As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. This leads to discrepancies between images and their textual representations presented to LLMs, which consequently impedes final reasoning performance. To fill the information gap and better leverage the reasoning capability, we design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image, along with filters for refining the generated information. We validate our idea on OK-VQA and A-OKVQA. Our method continuously boosts the performance of baselines methods by an average gain of 2.15% on OK-VQA, and achieves consistent improvements across different LLMs.
