Table of Contents
Fetching ...

Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, Naveed Akhtar

TL;DR

This survey surveys 3D Scene Question Answering (3D SQA), a field that merges 3D visual perception with natural language understanding to enable embodied reasoning in 3D environments. It inventories datasets, methods, and evaluation metrics, tracing a shift from manually curated datasets to LVLM-assisted generation and from task-specific pipelines to instruction-tuned, zero-shot approaches. The authors identify core challenges in dataset quality, multimodal alignment, and standardized evaluation, and propose directions spanning dataset construction, task generalization, interaction modeling, and unified benchmarks. Overall, the work provides a foundation for building more generalizable, spatially grounded 3D SQA systems capable of supporting real-world embodied AI tasks.

Abstract

3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.

Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

TL;DR

This survey surveys 3D Scene Question Answering (3D SQA), a field that merges 3D visual perception with natural language understanding to enable embodied reasoning in 3D environments. It inventories datasets, methods, and evaluation metrics, tracing a shift from manually curated datasets to LVLM-assisted generation and from task-specific pipelines to instruction-tuned, zero-shot approaches. The authors identify core challenges in dataset quality, multimodal alignment, and standardized evaluation, and propose directions spanning dataset construction, task generalization, interaction modeling, and unified benchmarks. Overall, the work provides a foundation for building more generalizable, spatially grounded 3D SQA systems capable of supporting real-world embodied AI tasks.

Abstract

3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.

Paper Structure

This paper contains 19 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: 2D Scene VQA and 3D SQA tasks. 3D SQA handles non-embodied as well as embodied tasks involving agent interactions within 3D scenes.
  • Figure 2: Graphical illustration of the hierarchical structure of 3D SQA literature adopted in this work. A systematic categorization is adopted for preliminaries, datasets, evaluation metrics and methodologies.
  • Figure 3: Dataset generation workflow.
  • Figure 4: Overview of a generalized 3D SQA pipeline. The scene input—represented as images or point clouds—is processed by a visual encoder, while the question input—comprising textual and potentially egocentric visual components—is encoded separately. The resulting features are fused via a dedicated fusion module and passed to a joint prediction head that outputs the answer and optional 3D bounding boxes. Recent approaches enhance this pipeline by incorporating large vision-language models (LVLMs) to support instruction tuning and zero-shot reasoning.
  • Figure 5: Typical architecture of task-specific 3D SQA methods. Scene and query (question) features are encoded separately, fused via a transformer-based module, and used to predict the answer, optionally with bounding boxes and object categories.
  • ...and 2 more figures