Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering
Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, Naveed Akhtar
TL;DR
This survey surveys 3D Scene Question Answering (3D SQA), a field that merges 3D visual perception with natural language understanding to enable embodied reasoning in 3D environments. It inventories datasets, methods, and evaluation metrics, tracing a shift from manually curated datasets to LVLM-assisted generation and from task-specific pipelines to instruction-tuned, zero-shot approaches. The authors identify core challenges in dataset quality, multimodal alignment, and standardized evaluation, and propose directions spanning dataset construction, task generalization, interaction modeling, and unified benchmarks. Overall, the work provides a foundation for building more generalizable, spatially grounded 3D SQA systems capable of supporting real-world embodied AI tasks.
Abstract
3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.
