Table of Contents
Fetching ...

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu

Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.
Paper Structure (34 sections, 4 equations, 11 figures, 12 tables)

This paper contains 34 sections, 4 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: The statistical information and representative cases of 3DCity-LLM-1.2M dataset. We provide two representative cases for each task type. The specific 3D numerical information is highlighted with underscores
  • Figure 2: Automated data generation pipeline of 3DCity-LLM-1.2M dataset, leveraging instance-level masks and landmark annotations from the SensatUrban, UrbanBIS, and City-BIS datasets. We first construct city scene attributes for each environments as input for VLM, then prompt the VLM to generate diverse and high-quality QA pairs grounded in the scene with multiple instructions. In total, we built 1.2M samples spanning object caption, object localization, object analysis, relationship computation, scene caption, scene analysis and scene planning tasks
  • Figure 3: Model architecture of 3DCity-LLM. 3DCity-LLM receives target object, its neighboring objects, city scene and text query as multi-modality inputs, then identifies task type based on the text query and activates the corresponding feature encoding branches before producing the final answers
  • Figure 4: The coarse-to-fine feature encoding in 3DCity-LLM. (a) Object Encoding (b) Relationship Encoding (c) Scene Encoding
  • Figure 5: Qualitative results on the object-level tasks from 3DCity-LLM-1.2M dataset. (a) Object Caption, (b) Object Analysis, (c) Object Localization
  • ...and 6 more figures