Table of Contents
Fetching ...

VoQA: Visual-only Question Answering

Jianing An, Luyang Jiang, Jie Luo, Wenjun Wu, Lei Huang

TL;DR

VoQA reframes visual understanding by embedding the question directly in a single image, requiring fully vision-based reasoning. A large VoQA dataset (≈3.35M training samples) and an evaluation benchmark (≈134k samples) are built by rendering questions into images from LLaVA and existing VQA datasets. Evaluations across open- and closed-source LVLMs show a clear gap between VoQA and traditional VQA, with question alignment emerging as a key bottleneck. The authors propose question-alignment supervised fine-tuning, notably QRA-SFT, which guides models to first parse the embedded question and then reason over the visual content, achieving robust VoQA performance and improved cross-task generalization to VQA. Code and data are publicly available, establishing VoQA as a practical platform for unified visual reasoning research.

Abstract

Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we investigate question-alignment fine-tuning strategies designed to guide models toward interpreting the visual question prior to reasoning. Leveraging VoQA dataset together with these strategies yields robust vision-only reasoning while preserving cross-task generalization to traditional VQA, reflecting the complementary visual and textual reasoning capabilities fostered through VoQA training. The code and data are publicly available.

VoQA: Visual-only Question Answering

TL;DR

VoQA reframes visual understanding by embedding the question directly in a single image, requiring fully vision-based reasoning. A large VoQA dataset (≈3.35M training samples) and an evaluation benchmark (≈134k samples) are built by rendering questions into images from LLaVA and existing VQA datasets. Evaluations across open- and closed-source LVLMs show a clear gap between VoQA and traditional VQA, with question alignment emerging as a key bottleneck. The authors propose question-alignment supervised fine-tuning, notably QRA-SFT, which guides models to first parse the embedded question and then reason over the visual content, achieving robust VoQA performance and improved cross-task generalization to VQA. Code and data are publicly available, establishing VoQA as a practical platform for unified visual reasoning research.

Abstract

Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we investigate question-alignment fine-tuning strategies designed to guide models toward interpreting the visual question prior to reasoning. Leveraging VoQA dataset together with these strategies yields robust vision-only reasoning while preserving cross-task generalization to traditional VQA, reflecting the complementary visual and textual reasoning capabilities fostered through VoQA training. The code and data are publicly available.

Paper Structure

This paper contains 75 sections, 1 equation, 13 figures, 19 tables.

Figures (13)

  • Figure 1: Comparison between (a) the traditional VQA task and (b) the Visual-only Question Answering (VoQA) task. Traditional VQA provides an image and a textual question as separate inputs, whereas VoQA embeds the question directly within the image, requiring reasoning purely through visual perception.
  • Figure 2: Two examples of watermark rendering with different text colors.
  • Figure 3: Average Accuracy (%) of all models on the VoQA benchmark under various zero-shot settings (pure visual-only, prompt-guided, and OCR-assisted) across all datasets, compared with traditional VQA benchmarks. The models correspond to those introduced in Section \ref{['sec: evaluation setup']}. Across all models and zero-shot settings, performance on VoQA is noticeably lower than on traditional VQA, highlighting the challenge of reasoning over visually embedded questions. Detailed results on each dataset are provided in Appendix \ref{['appendix: zero-shot results']}.
  • Figure 4: Average Question Alignment Accuracy (QAA) and Answer Accuracy (ACC) for all models across four VoQA sub-tasks under the two workflow prompt settings, except VQAv2. Correct Answers and Incorrect Answers indicate averages computed over correctly and incorrectly answered samples, respectively. The models correspond to those introduced in Section \ref{['sec: evaluation setup']}. For each model, the results shown correspond to the workflow setting (short or long) that yields the higher average ACC. Complete QAA and ACC results are provided in Appendix \ref{['appendix: zero-shot results']}.
  • Figure 5: Comparison of four supervised fine-tuning strategies. The first two represent baseline fine-tuning under the VQA and VoQA settings, respectively. The bottom two are our proposed VoQA-specific methods that first align the visually embedded question before generating the answer.
  • ...and 8 more figures