Understanding Depth and Height Perception in Large Visual-Language Models

Shehreen Azad; Yash Jain; Rishit Garg; Yogesh S Rawat; Vibhav Vineet

Understanding Depth and Height Perception in Large Visual-Language Models

Shehreen Azad, Yash Jain, Rishit Garg, Yogesh S Rawat, Vibhav Vineet

TL;DR

This work tackles the gap in geometric understanding of Vision Language Models by introducing GeoMeter, a synthetic benchmark suite with GeoMeter-2D and GeoMeter-3D designed to probe depth and height perception. By evaluating 18 VLMs on depth/height VQA tasks constructed from controlled 2D and 3D scenes, the study reveals that while basic geometry is captured, depth and especially height reasoning remain challenging and susceptible to biases. The findings suggest that improvements require architectural advances and targeted training beyond prompting, with practical implications for real-world perception tasks in areas like navigation and assistive tech. The work establishes a clear benchmark and baseline for future efforts to enhance geometric reasoning in multimodal models.

Abstract

Geometric understanding - including depth and height perception - is fundamental to intelligence and crucial for navigating our environment. Despite the impressive capabilities of large Vision Language Models (VLMs), it remains unclear how well they possess the geometric understanding required for practical applications in visual perception. In this work, we focus on evaluating the geometric understanding of these models, specifically targeting their ability to perceive the depth and height of objects in an image. To address this, we introduce GeoMeter, a suite of benchmark datasets - encompassing 2D and 3D scenarios - to rigorously evaluate these aspects. By benchmarking 18 state-of-the-art VLMs, we found that although they excel in perceiving basic geometric properties like shape and size, they consistently struggle with depth and height perception. Our analysis reveal that these challenges stem from shortcomings in their depth and height reasoning capabilities and inherent biases. This study aims to pave the way for developing VLMs with enhanced geometric understanding by emphasizing depth and height perception as critical components necessary for real-world applications.

Understanding Depth and Height Perception in Large Visual-Language Models

TL;DR

Abstract

Paper Structure (22 sections, 15 figures, 6 tables)

This paper contains 22 sections, 15 figures, 6 tables.

Introduction
Related Works
Benchmark
Datasets
Image Generation
Question Generation
Experimental Setup
Vision Language Models
Human Evaluators
Evaluation Metrics
Implementation Details
Results
Analysis and Discussion
Model Behavior Analysis
Model Bias Analysis
...and 7 more sections

Figures (15)

Figure 1: Depth and height perception capability of existing VLM. Here, we show failure cases of GPT-4V in understanding depth and height on GeoMeter, our proposed suite of benchmark datasets.
Figure 2: Samples from the proposed suite of benchmark datasets. Here each samples are shown with random query attributes- color and numeric label for GeoMeter-2D; and color and material for GeoMeter-3D dataset.
Figure 3: Sample image-text pair from the datasets. Here, prompt template shows the basic template for each image-text pair in our datasets, where the prompt example is the actual prompt for the image. The prompt example is appended with either MCQ or True/False type question.
Figure 4: Depth and height perception performance on the proposed GeoMeter-2D and GeoMeter-3D dataset on MCQ and True/False (T/F) questions. D and H respectively denote depth, height performance. For example, 2D(D) MCQ and 2D(H) MCQ corresponds to respectively GeoMeter-2D depth and height performance on MCQ questions. Y-axis denotes the average performance across shape and query attributes and X-axis denotes all the evaluated models. Darker color denotes better performance.
Figure 5: Model behavior on basic understanding of shapes and size on our created GeoMeter-2D-Basic dataset (samples on the left). Performance of selected models on this dataset is shown in right. Here, LU, SI, SC and SR respectively denote line understanding, shape identification, shape counting and spatial reasoning. Y-axis denotes performance accuracy of different categories and X-axis denotes evaluated models. Darker color denotes better performance.
...and 10 more figures

Understanding Depth and Height Perception in Large Visual-Language Models

TL;DR

Abstract

Understanding Depth and Height Perception in Large Visual-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)