Table of Contents
Fetching ...

Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference

Amit Bendkhale

TL;DR

The paper presents Tri-Bench, a compact, controlled benchmark that isolates relative geometric reasoning in Vision-Language Models under camera tilt and object interference. It reveals a pervasive 2D-image-plane bias, poor handling of minority triangle shapes, and measurable degradation under tilt, despite a guardrail prompt aimed at invoking homography-based 3D reasoning. The work highlights critical gaps in verifiability for deployment-critical tasks and provides a reproducible diagnostic to guide future improvements in robust, trustworthy spatial reasoning for agentic AI.

Abstract

Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.

Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference

TL;DR

The paper presents Tri-Bench, a compact, controlled benchmark that isolates relative geometric reasoning in Vision-Language Models under camera tilt and object interference. It reveals a pervasive 2D-image-plane bias, poor handling of minority triangle shapes, and measurable degradation under tilt, despite a guardrail prompt aimed at invoking homography-based 3D reasoning. The work highlights critical gaps in verifiability for deployment-critical tasks and provides a reproducible diagnostic to guide future improvements in robust, trustworthy spatial reasoning for agentic AI.

Abstract

Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.

Paper Structure

This paper contains 20 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Four capture conditions for triangle 037. Top row displays the four captures P0, P1, T0, and T1; bottom row displays the corresponding marked images. The triangle is obtuse in P0_marked but appears right-angled in T0_marked.
  • Figure 2: Average accuracy across all models for each task (Q1--Q6) under the four capture conditions P0, P1, T0, and T1. Tilted views (T0/T1) are consistently less accurate than planar views (P0/P1), while the presence of an object (P1/T1 vs. P0/T0) has only a minor effect.
  • Figure 3: The ten everyday objects used in Tri-Bench.