Table of Contents
Fetching ...

3D Question Answering for City Scene Understanding

Penglei Sun, Yaoxian Song, Xiang Liu, Xiaofei Yang, Qiang Wang, Tiefeng Li, Yang Yang, Xiaowen Chu

TL;DR

This work tackles 3D multimodal question answering for city-scale scene understanding by introducing City-3DQA, the first dataset to incorporate city-level scene semantics and human-environment interaction, and a baseline method Sg-CityU that leverages a city-level scene graph. By encoding both 3D point clouds and a structured scene graph, and fusing them through a multimodal fusion network, Sg-CityU achieves state-of-the-art robustness and generalization, significantly outperforming indoor 3D MQA models and zero-shot large language models on City-3DQA. Key contributions include the City-3DQA data construction pipeline (including City-level Instance Segmentation, Scene Semantic Extraction, and 33 question templates across five categories), and the Scene graph enhanced City-level Understanding approach that exploits spatial relationships for accurate, city-aware reasoning. The results demonstrate strong potential for real-world intelligent agents in urban environments, enabling more capable and reliable city-scale perception, reasoning, and human-environment interaction.

Abstract

3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.

3D Question Answering for City Scene Understanding

TL;DR

This work tackles 3D multimodal question answering for city-scale scene understanding by introducing City-3DQA, the first dataset to incorporate city-level scene semantics and human-environment interaction, and a baseline method Sg-CityU that leverages a city-level scene graph. By encoding both 3D point clouds and a structured scene graph, and fusing them through a multimodal fusion network, Sg-CityU achieves state-of-the-art robustness and generalization, significantly outperforming indoor 3D MQA models and zero-shot large language models on City-3DQA. Key contributions include the City-3DQA data construction pipeline (including City-level Instance Segmentation, Scene Semantic Extraction, and 33 question templates across five categories), and the Scene graph enhanced City-level Understanding approach that exploits spatial relationships for accurate, city-aware reasoning. The results demonstrate strong potential for real-world intelligent agents in urban environments, enabling more capable and reliable city-scale perception, reasoning, and human-environment interaction.

Abstract

3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.
Paper Structure (24 sections, 9 equations, 5 figures, 6 tables)

This paper contains 24 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of the City-3DQA with other 3D multimodal question answering (MQA) tasks. The existing research in 3D MQA focuses on the indoor household scene (a) and outdoor autonomous driving scene (b). However, these researches lack spatial semantic and city-level interaction information within the city. City-3DQA (c) is the first dataset to focus on 3D MQA for outdoor city scene understanding.
  • Figure 2: Data Construction Pipeline for City-3DQA. The pipeline consists of three main stages: City-level Instance Segmentation, Scene Semantic Extraction, and Question-Answer Pair Construction.
  • Figure 3: The statistical distributions of questions within the City-3DQA dataset are presented. The question length means the number of words in the question sentence. Multi and Single mean the multi-hop questions and single-hop questions respectively.
  • Figure 4: The framework of our proposed model Sg-CityU (a) and Fusion Layer in Sg-CityU (b). In Sg-CityU, the question, scene graph, and point clouds are processed by the feature extraction backbone to obtain multimodal features. Finally, the multimodal features are fed into Fusion Layer and Answer Layer for answer generation. In Fusion Layer, we build layers of multimodal fusion network (MMFN) based on self-attention and cross-attention to fuse different model inputs.
  • Figure 5: Visualization of examples. We compare and visualize the answer generated by Qwen-VL, Llama-2 and Sg-CityU. We visualize the city scene with the instance label and scene graph (sg). ✓ and ✘ mean the correct answer and wrong answer respectively.