Table of Contents
Fetching ...

Right this way: Can VLMs Guide Us to See More to Answer Questions?

Li Liu, Diji Yang, Sijia Zhong, Kalyana Suma Sree Tholeti, Lei Ding, Yi Zhang, Leilani H. Gilpin

TL;DR

This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans by presenting an automated framework that generates synthetic training data by simulating ``where to know'' scenarios.

Abstract

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know'' scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.

Right this way: Can VLMs Guide Us to See More to Answer Questions?

TL;DR

This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans by presenting an automated framework that generates synthetic training data by simulating ``where to know'' scenarios.

Abstract

In question-answering scenarios, humans can assess whether the available information is sufficient and seek additional information if necessary, rather than providing a forced answer. In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information. To investigate this gap, we identify a critical and challenging task in the Visual Question Answering (VQA) scenario: can VLMs indicate how to adjust an image when the visual information is insufficient to answer a question? This capability is especially valuable for assisting visually impaired individuals who often need guidance to capture images correctly. To evaluate this capability of current VLMs, we introduce a human-labeled dataset as a benchmark for this task. Additionally, we present an automated framework that generates synthetic training data by simulating ``where to know'' scenarios. Our empirical results show significant performance improvements in mainstream VLMs when fine-tuned with this synthetic data. This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.

Paper Structure

This paper contains 21 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: The examples of the Directional Guidance task. The model utilizes self-knowledge to distinguish between known and unknown information and provides guidance on where to find more information.
  • Figure 2: The training set generation framework.
  • Figure 3: The distribution of four directions in our benchmark dataset (a) and examples (b-e). The upper caption is the Directional Guidance label and the lower caption is the original question.
  • Figure 4: The heatmaps of the model's prediction. (a1)-(a4) shows the baseline performance under zero-shot setting, and (b1)-(b4) shows the performances of fine-tuned models. 'O' denotes the class leave it unchanged, and 'X' denotes the class none of the other options.
  • Figure 5: A screenshot of the annotation work.
  • ...and 5 more figures