Table of Contents
Fetching ...

OpenCity3D: What do Vision-Language Models know about Urban Environments?

Valentin Bieri, Marco Zamboni, Nicolas S. Blumer, Qingxuan Chen, Francis Engelmann

TL;DR

OpenCity3D demonstrates that vision-language models can be leveraged for city-scale 3D scene understanding to infer socio-economic attributes from aerial reconstructions. By constructing a language-enriched 3D city representation that fuses multi-scale SAM-based segmentation with per-vertex VLM features, and by applying similarity, supervised, or GPT-4o-based predictions, the approach achieves strong zero-shot and few-shot performance for building footprint, building age, and housing price estimation across multiple cities, while crime and noise predictions remain more challenging. The work provides a foundational benchmark and methodology for language-driven urban analytics, with practical implications for planning, policy, and environmental monitoring, alongside clear limitations related to data availability, bias, and scalability. Overall, OpenCity3D offers a scalable framework to translate language-grounded urban knowledge into actionable 3D analytics, highlighting both the potential and the need for broader datasets and debiasing strategies.

Abstract

Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io

OpenCity3D: What do Vision-Language Models know about Urban Environments?

TL;DR

OpenCity3D demonstrates that vision-language models can be leveraged for city-scale 3D scene understanding to infer socio-economic attributes from aerial reconstructions. By constructing a language-enriched 3D city representation that fuses multi-scale SAM-based segmentation with per-vertex VLM features, and by applying similarity, supervised, or GPT-4o-based predictions, the approach achieves strong zero-shot and few-shot performance for building footprint, building age, and housing price estimation across multiple cities, while crime and noise predictions remain more challenging. The work provides a foundational benchmark and methodology for language-driven urban analytics, with practical implications for planning, policy, and environmental monitoring, alongside clear limitations related to data availability, bias, and scalability. Overall, OpenCity3D offers a scalable framework to translate language-grounded urban knowledge into actionable 3D analytics, highlighting both the potential and the need for broader datasets and debiasing strategies.

Abstract

Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io

Paper Structure

This paper contains 37 sections, 2 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: OpenCity3D is a method for zero-shot urban 3D scene understanding, enabling insights into higher-level attributes such as crime rates, population density, housing prices, and local landmarks. For each text prompt, we visualize a response heatmap, where areas of higher relevance are highlighted in yellow, transitioning to blue for lower relevance.
  • Figure 2: The OpenCity3D model. Multi-view RGB-D images are rendered from aerial 3D reconstructions, followed by extracting pixel-wise hierarchical visual-language features. These features are mapped back to the 3D mesh, enabling language-based queries.
  • Figure 3: Visualization of SAM's multi-scale segments across three hierarchy levels: small, medium, and large.
  • Figure 4: Example of a highlighted segments. Not removing the background provides context
  • Figure 5: Illustration of the challenge to estimate building ages in the Rotterdam mesh: the left building dates from 1907, while the right one (directly adjacent) was built in 1997.
  • ...and 12 more figures