Table of Contents
Fetching ...

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang

TL;DR

<3-5 sentence high-level summary>VisOnlyQA targets the underexplored area of geometric perception in LVLMs, separating perceptual accuracy of basic geometric information from higher-order reasoning. The authors introduce VisOnlyQA, a 12-task, geometry-focused dataset spanning geometric shapes, chemistry, charts, and 3D figures, with Real and Synthetic splits to probe pure perception. Comprehensive experiments across 23 LVLMs, varied prompts, and fine-tuning settings reveal substantial gaps between machine and human performance, and show that larger language models help but do not eliminate perceptual bottlenecks, pointing to bottlenecks in processing visual encoder outputs. The work provides reproducible benchmarks and insights into where future LVLM improvements should focus, including data design and the role of LLMs in geometry perception, with public availability of data, code, and model responses.

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception. 23 LVLMs we evaluate, including GPT-4o and Gemini 2.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue. Fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) LLM may be the bottleneck. LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

TL;DR

<3-5 sentence high-level summary>VisOnlyQA targets the underexplored area of geometric perception in LVLMs, separating perceptual accuracy of basic geometric information from higher-order reasoning. The authors introduce VisOnlyQA, a 12-task, geometry-focused dataset spanning geometric shapes, chemistry, charts, and 3D figures, with Real and Synthetic splits to probe pure perception. Comprehensive experiments across 23 LVLMs, varied prompts, and fine-tuning settings reveal substantial gaps between machine and human performance, and show that larger language models help but do not eliminate perceptual bottlenecks, pointing to bottlenecks in processing visual encoder outputs. The work provides reproducible benchmarks and insights into where future LVLM improvements should focus, including data design and the role of LLMs in geometry perception, with public availability of data, code, and model responses.

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception. 23 LVLMs we evaluate, including GPT-4o and Gemini 2.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue. Fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) LLM may be the bottleneck. LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Paper Structure

This paper contains 58 sections, 6 figures, 65 tables.

Figures (6)

  • Figure 1: Examples from 12 tasks in VisOnlyQA and answers from LVLMs. Figures in VisOnlyQA are from existing datasets or generated by us, and all questions are created by us. Questions in this figure are abbreviated. Refer to Appendix \ref{['appendix:examples']} for full inputs and responses.
  • Figure 2: LVLMs perform poorly on VisOnlyQA, while human performance is nearly perfect. Table \ref{['tab:results-no-cot']} provides detailed results.
  • Figure 3: Construction process of synthetic images and questions in VisOnlyQA-Eval-Synthetic and VisOnlyQA-Train. This process does not involve language models and uses precise metadata, guaranteeing the correctness of generated question-answer pairs.
  • Figure 4: Example figures and model outputs for the analysis dataset. LVLMs exhibit poor geometric perception even on very simple geometric shapes.
  • Figure 5: Error categories in chain-of-thought reasoning by LVLMs on VisOnlyQA-Eval-Real. Almost all errors are visual perception errors, verifying that our dataset evaluates the geometric perception of LVLMs independent of other capabilities. Each response can include multiple categories of errors.
  • ...and 1 more figures