Table of Contents
Fetching ...

Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, Xiao-Ping Zhang

TL;DR

Open3D-VQA introduces a large-scale benchmark for evaluating spatial reasoning in 3D urban environments from an aerial perspective, combining real-world and simulated data with an automated QA-generation pipeline. It defines four reasoning types across seven tasks and supports RGB and point-cloud modalities, enabling comprehensive evaluation of 2D and 3D multimodal LLMs. The authors report that current models struggle with allocentric-egocentric transformations, depend on depth information for absolute distance, and benefit from sim-to-real fine-tuning, while offering a robust data-generation pipeline and evaluation toolkit. Overall, the benchmark reveals clear gaps in spatial reasoning for MLLMs and demonstrates promising sim-to-real transfer when trained on simulated data alone.

Abstract

Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

TL;DR

Open3D-VQA introduces a large-scale benchmark for evaluating spatial reasoning in 3D urban environments from an aerial perspective, combining real-world and simulated data with an automated QA-generation pipeline. It defines four reasoning types across seven tasks and supports RGB and point-cloud modalities, enabling comprehensive evaluation of 2D and 3D multimodal LLMs. The authors report that current models struggle with allocentric-egocentric transformations, depend on depth information for absolute distance, and benefit from sim-to-real fine-tuning, while offering a robust data-generation pipeline and evaluation toolkit. Overall, the benchmark reveals clear gaps in spatial reasoning for MLLMs and demonstrates promising sim-to-real transfer when trained on simulated data alone.

Abstract

Spatial reasoning is a fundamental capability of multimodal large language models (MLLMs), yet their performance in open aerial environments remains underexplored. In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats, and supports both visual and point cloud modalities. The questions are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs reveals that: 1) Models are generally better at answering questions about relative spatial relations than absolute distances, 2) 3D LLMs fail to demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on the simulated dataset can significantly improve the model's spatial reasoning performance in real-world scenarios. We release our benchmark, data generation pipeline, and evaluation toolkit to support further research: https://github.com/EmbodiedCity/Open3D-VQA.code.

Paper Structure

This paper contains 29 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: The overview of Open3D-VQA. This work includes integration of real-world and simulated data collection platforms, an automatic toolchain for QA generation, and a large-scale aerial spatial reasoning benchmark.
  • Figure 2: The data curation pipeline and dataset statistics.
  • Figure 3: The average accuracy of LLaVA-1.5 and Qwen2-VL in real-world and simulated scenes.
  • Figure 4: Three common errors of MLLMs on Open3D-VQA.
  • Figure 5: The caption prompt with GPT-4o
  • ...and 5 more figures