Table of Contents
Fetching ...

Multi-modal Situated Reasoning in 3D Scenes

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiaojian Ma, Baoxiong Jia, Siyuan Huang

TL;DR

The paper tackles the need for grounded, multi-modal situated reasoning in real-world 3D scenes for embodied AI. It introduces MSQA, a large-scale dataset of 251K situated QA pairs, and MSNN, a next-step navigation benchmark, both built via an automated pipeline that leverages 3D scene graphs and vision-language models to create interleaved text, image, and point-cloud inputs. By adopting an interleaved multi-modal input paradigm, the study demonstrates that current vision-language models struggle without explicit situation modeling, and introduces MSR3D as a strong baseline tailored to this setting. Scaling analyses and cross-domain transfer experiments show MSQA can serve as an effective pretraining resource for developing more capable situated reasoning and navigation models in 3D scenes. Overall, the work advances embodied AI by providing scalable data, robust benchmarks, and a practical model that grounds reasoning in multi-modal 3D environments.

Abstract

Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.

Multi-modal Situated Reasoning in 3D Scenes

TL;DR

The paper tackles the need for grounded, multi-modal situated reasoning in real-world 3D scenes for embodied AI. It introduces MSQA, a large-scale dataset of 251K situated QA pairs, and MSNN, a next-step navigation benchmark, both built via an automated pipeline that leverages 3D scene graphs and vision-language models to create interleaved text, image, and point-cloud inputs. By adopting an interleaved multi-modal input paradigm, the study demonstrates that current vision-language models struggle without explicit situation modeling, and introduces MSR3D as a strong baseline tailored to this setting. Scaling analyses and cross-domain transfer experiments show MSQA can serve as an effective pretraining resource for developing more capable situated reasoning and navigation models in 3D scenes. Overall, the work advances embodied AI by providing scalable data, robust benchmarks, and a practical model that grounds reasoning in multi-modal 3D environments.

Abstract

Situation awareness is essential for understanding and reasoning about 3D scenes in embodied AI agents. However, existing datasets and benchmarks for situated understanding are limited in data modality, diversity, scale, and task scope. To address these limitations, we propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide text, image, and point cloud for situation and question description, resolving ambiguity in previous single-modality convention (e.g., text). Additionally, we devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation. Comprehensive evaluations on MSQA and MSNN highlight the limitations of existing vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling. Experiments on data scaling and cross-domain transfer further demonstrate the efficacy of leveraging MSQA as a pre-training dataset for developing more powerful situated reasoning models.
Paper Structure (57 sections, 3 equations, 28 figures, 19 tables)

This paper contains 57 sections, 3 equations, 28 figures, 19 tables.

Figures (28)

  • Figure 1: An overview of benchmarking tasks in MSQA. We use green boxes for objects mentioned in situation descriptions, red for objects in questions, and purple for objects in navigation instructions.
  • Figure 2: An illustration on resolving ambiguity with interleaved multi-modal input. With both chairs highlighted in purple and green boxes having the same textual description "chair is next to the table", one can easily identify the target chair from the candidates by providing an image describing its location.
  • Figure 3: An overview of our data collection pipeline, including situated scene graph generation, situated QA pairs generation, and various post-processing procedures.
  • Figure 4: Dataset statistics and quality evaluation. We visualize (a) the distribution of question types in msqa, (b) average quality scores of msqa, and (c) the proportion of high-scoring data compared with SQA3D.
  • Figure 5: The generation pipeline of the multi-modal situated next-step navigation (MSNN) task. We follow a generation pipeline similar to QA pairs for situated navigation action.
  • ...and 23 more figures