Table of Contents
Fetching ...

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

Navid Rajabi, Jana Kosecka

TL;DR

This paper introduces GSR-Bench, an extended benchmark for grounded spatial reasoning that augments the What'sUp dataset with depth, bounding boxes, and masks to enable fine-grained evaluation of spatial understanding by both VLMs and multimodal LLMs. It compares 18 VLMs and 9 MLLMs using depth-aware and grounding-focused evaluation, highlighting prompting strategy biases and demonstrating that depth-augmented prompts can improve performance in depth-sensitive cases. A CircularEval-based, template-driven prompting approach is shown to yield more reliable assessments than traditional multiple-choice prompts, with results indicating that larger, higher-resolution MLLMs achieve substantial gains over prior baselines and that grounding accuracy causally relates to reasoning success. The work provides a cost-efficient auto-annotation pipeline for grounding and discusses limitations, ultimately showing a significant performance lead for LLaVA-based models over prior state-of-the-art and outlining future directions toward closing remaining gaps to human performance.

Abstract

The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. This skill rests on the ability to recognize and localize objects of interest and determine their spatial relation. Early vision and language models (VLMs) have been shown to struggle to recognize spatial relations. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding that highlights the strengths and weaknesses of 27 different models. In addition to the VLMs evaluated in What'sUp, our extensive evaluation encompasses 3 classes of Multimodal LLMs (MLLMs) that vary in their parameter sizes (ranging from 7B to 110B), training/instruction-tuning methods, and visual resolution to benchmark their performances and scrutinize the scaling laws in this task.

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

TL;DR

This paper introduces GSR-Bench, an extended benchmark for grounded spatial reasoning that augments the What'sUp dataset with depth, bounding boxes, and masks to enable fine-grained evaluation of spatial understanding by both VLMs and multimodal LLMs. It compares 18 VLMs and 9 MLLMs using depth-aware and grounding-focused evaluation, highlighting prompting strategy biases and demonstrating that depth-augmented prompts can improve performance in depth-sensitive cases. A CircularEval-based, template-driven prompting approach is shown to yield more reliable assessments than traditional multiple-choice prompts, with results indicating that larger, higher-resolution MLLMs achieve substantial gains over prior baselines and that grounding accuracy causally relates to reasoning success. The work provides a cost-efficient auto-annotation pipeline for grounding and discusses limitations, ultimately showing a significant performance lead for LLaVA-based models over prior state-of-the-art and outlining future directions toward closing remaining gaps to human performance.

Abstract

The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. This skill rests on the ability to recognize and localize objects of interest and determine their spatial relation. Early vision and language models (VLMs) have been shown to struggle to recognize spatial relations. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding that highlights the strengths and weaknesses of 27 different models. In addition to the VLMs evaluated in What'sUp, our extensive evaluation encompasses 3 classes of Multimodal LLMs (MLLMs) that vary in their parameter sizes (ranging from 7B to 110B), training/instruction-tuning methods, and visual resolution to benchmark their performances and scrutinize the scaling laws in this task.
Paper Structure (17 sections, 5 figures, 3 tables)

This paper contains 17 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: LLaMA-3-LLaVA-NeXT-8B achieves the overall accuracy of 86.1%, compared to 60.4% by XVLM-COCO, in What'sUp benchmark, reaching the best trade-off between accuracy and parameters size, since it performs only 1.1% lower than LLaVA-NeXT-34B, which has $\times$4.25 number of parameters.
  • Figure 2: Our pipeline overview for spatial relationship understanding prompting, shown in the top two figures, and our depth-augmented prompting, shown in the bottom figure.
  • Figure 3: Sensitivity of the models to different permutations of choice order, in the multiple-choice (MC) experiment, which is more significant in the smaller models, and when having two choices of A and B instead of regular 4-choice of A, B, C, and D. LLaVA-NeXT-Yi-34B demonstrates an excellent robustness against this issue.
  • Figure 4: Causal effect analysis between grounding and reasoning accuracy, for the Subset A subjects that was the most difficult setting for the models for localization.
  • Figure 5: Sample failures in small objects grounding (i.e., IoU $< 0.5$), which refers to the Sub column results of Subset A in Table \ref{['tab:auto_annot_grounding_pipeline']}. The pseudo-ground-truth bounding box, which is the GroudningDINO output, is indicated in green, and the output of LLaVA-NeXT-Qwen-1.5-110B, which is the best-performing MLLM in our grounding/localization experiment, is demonstrated in yellow.