VoQA: Visual-only Question Answering
Jianing An, Luyang Jiang, Jie Luo, Wenjun Wu, Lei Huang
TL;DR
VoQA reframes visual understanding by embedding the question directly in a single image, requiring fully vision-based reasoning. A large VoQA dataset (≈3.35M training samples) and an evaluation benchmark (≈134k samples) are built by rendering questions into images from LLaVA and existing VQA datasets. Evaluations across open- and closed-source LVLMs show a clear gap between VoQA and traditional VQA, with question alignment emerging as a key bottleneck. The authors propose question-alignment supervised fine-tuning, notably QRA-SFT, which guides models to first parse the embedded question and then reason over the visual content, achieving robust VoQA performance and improved cross-task generalization to VQA. Code and data are publicly available, establishing VoQA as a practical platform for unified visual reasoning research.
Abstract
Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we investigate question-alignment fine-tuning strategies designed to guide models toward interpreting the visual question prior to reasoning. Leveraging VoQA dataset together with these strategies yields robust vision-only reasoning while preserving cross-task generalization to traditional VQA, reflecting the complementary visual and textual reasoning capabilities fostered through VoQA training. The code and data are publicly available.
