VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
TL;DR
<3-5 sentence high-level summary>VisOnlyQA targets the underexplored area of geometric perception in LVLMs, separating perceptual accuracy of basic geometric information from higher-order reasoning. The authors introduce VisOnlyQA, a 12-task, geometry-focused dataset spanning geometric shapes, chemistry, charts, and 3D figures, with Real and Synthetic splits to probe pure perception. Comprehensive experiments across 23 LVLMs, varied prompts, and fine-tuning settings reveal substantial gaps between machine and human performance, and show that larger language models help but do not eliminate perceptual bottlenecks, pointing to bottlenecks in processing visual encoder outputs. The work provides reproducible benchmarks and insights into where future LVLM improvements should focus, including data design and the role of LLMs in geometry perception, with public availability of data, code, and model responses.
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception. 23 LVLMs we evaluate, including GPT-4o and Gemini 2.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue. Fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) LLM may be the bottleneck. LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.
