Table of Contents
Fetching ...

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Fei Kong

TL;DR

LRR-Bench presents a fully synthetic spatial reasoning benchmark for Vision-Language Models, focusing on absolute positioning and 3D spatial understanding, including rotation and movement in 2D/3D settings. The authors build a low-cost generation pipeline using diffusion models and Minecraft, with filtering via GroundingDINO and SAM, and evaluate 20+ LVLMs under direct and reasoning-augmented prompts. Results show humans outperform models by a wide margin, with VLMs achieving near-random performance on most 3D tasks and only modest gains on the simplest absolute-position tasks, highlighting persistent spatial understanding gaps. The work argues synthetic data is viable for rigorous spatial reasoning evaluation, and suggests that model scaling, 3D-finetuning, and reasoning prompts do not reliably close the gap, pointing to new directions in spatial cognition research.

Abstract

Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

TL;DR

LRR-Bench presents a fully synthetic spatial reasoning benchmark for Vision-Language Models, focusing on absolute positioning and 3D spatial understanding, including rotation and movement in 2D/3D settings. The authors build a low-cost generation pipeline using diffusion models and Minecraft, with filtering via GroundingDINO and SAM, and evaluate 20+ LVLMs under direct and reasoning-augmented prompts. Results show humans outperform models by a wide margin, with VLMs achieving near-random performance on most 3D tasks and only modest gains on the simplest absolute-position tasks, highlighting persistent spatial understanding gaps. The work argues synthetic data is viable for rigorous spatial reasoning evaluation, and suggests that model scaling, 3D-finetuning, and reasoning prompts do not reliably close the gap, pointing to new directions in spatial cognition research.

Abstract

Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

Paper Structure

This paper contains 15 sections, 1 equation, 11 figures, 3 tables.

Figures (11)

  • Figure 1: This diagram illustrates our categorization of the spatial understanding problem and the overall pipeline. The blue section represents 3D spatial understanding, while the yellow section represents absolute position understanding. In 3D spatial understanding, we decompose spatial translation into rotation and movement, applying them separately to the camera and the object. In absolute position understanding, we detect the object's absolute position within the image, such as the center, top-left, and so on. For the tasks related to absolute position and depth, the samples are generated using a diffusion model. These samples are then filtered by GroundingDINO and processed by various models. Samples for the other tasks are generated by applying movement and rotation to the camera and object within Minecraft.
  • Figure 2: This diagram illustrates the process of generating samples using the diffusion model. The first column displays the prompt. The generated samples are then fed to GroundingDINO, which outputs the bounding box and a confidence score. The confidence score is categorized into three classes: existing classes, non-existing classes, and uncertain classes. The bounding boxes and confidence scores are used to filter the samples. The filtered samples are subsequently fed to the next stage, which varies depending on the specific task.
  • Figure 3: Please answer if the image has book (suitcase) at bottom-left (bottom-left) of the image. Please answer Yes or No.
  • Figure 4: Follow the subplot order A, B, C to check whether there is a handbag (broccoli) located in each subplot matches top-left (top-left), bottom-left (bottom-right), and top-left (bottom-left), respectively. Please answer Yes if all position is right, otherwise No.
  • Figure 5: The background of the sequence is same with different camera. Please answer if the camera's rotatation direction of the image sequence is same following A, B, C, D. The answer is either Yes or No.
  • ...and 6 more figures