3D Question Answering for City Scene Understanding
Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang, Tiefeng Li, Yang Yang, Xiaowen Chu
TL;DR
This work tackles 3D multimodal question answering for city-scale scene understanding by introducing City-3DQA, the first dataset to incorporate city-level scene semantics and human-environment interaction, and a baseline method Sg-CityU that leverages a city-level scene graph. By encoding both 3D point clouds and a structured scene graph, and fusing them through a multimodal fusion network, Sg-CityU achieves state-of-the-art robustness and generalization, significantly outperforming indoor 3D MQA models and zero-shot large language models on City-3DQA. Key contributions include the City-3DQA data construction pipeline (including City-level Instance Segmentation, Scene Semantic Extraction, and 33 question templates across five categories), and the Scene graph enhanced City-level Understanding approach that exploits spatial relationships for accurate, city-aware reasoning. The results demonstrate strong potential for real-world intelligent agents in urban environments, enabling more capable and reliable city-scale perception, reasoning, and human-environment interaction.
Abstract
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.
