Understanding Depth and Height Perception in Large Visual-Language Models
Shehreen Azad, Yash Jain, Rishit Garg, Yogesh S Rawat, Vibhav Vineet
TL;DR
This work tackles the gap in geometric understanding of Vision Language Models by introducing GeoMeter, a synthetic benchmark suite with GeoMeter-2D and GeoMeter-3D designed to probe depth and height perception. By evaluating 18 VLMs on depth/height VQA tasks constructed from controlled 2D and 3D scenes, the study reveals that while basic geometry is captured, depth and especially height reasoning remain challenging and susceptible to biases. The findings suggest that improvements require architectural advances and targeted training beyond prompting, with practical implications for real-world perception tasks in areas like navigation and assistive tech. The work establishes a clear benchmark and baseline for future efforts to enhance geometric reasoning in multimodal models.
Abstract
Geometric understanding - including depth and height perception - is fundamental to intelligence and crucial for navigating our environment. Despite the impressive capabilities of large Vision Language Models (VLMs), it remains unclear how well they possess the geometric understanding required for practical applications in visual perception. In this work, we focus on evaluating the geometric understanding of these models, specifically targeting their ability to perceive the depth and height of objects in an image. To address this, we introduce GeoMeter, a suite of benchmark datasets - encompassing 2D and 3D scenarios - to rigorously evaluate these aspects. By benchmarking 18 state-of-the-art VLMs, we found that although they excel in perceiving basic geometric properties like shape and size, they consistently struggle with depth and height perception. Our analysis reveal that these challenges stem from shortcomings in their depth and height reasoning capabilities and inherent biases. This study aims to pave the way for developing VLMs with enhanced geometric understanding by emphasizing depth and height perception as critical components necessary for real-world applications.
