OpenCity3D: What do Vision-Language Models know about Urban Environments?

Valentin Bieri; Marco Zamboni; Nicolas S. Blumer; Qingxuan Chen; Francis Engelmann

OpenCity3D: What do Vision-Language Models know about Urban Environments?

Valentin Bieri, Marco Zamboni, Nicolas S. Blumer, Qingxuan Chen, Francis Engelmann

TL;DR

OpenCity3D demonstrates that vision-language models can be leveraged for city-scale 3D scene understanding to infer socio-economic attributes from aerial reconstructions. By constructing a language-enriched 3D city representation that fuses multi-scale SAM-based segmentation with per-vertex VLM features, and by applying similarity, supervised, or GPT-4o-based predictions, the approach achieves strong zero-shot and few-shot performance for building footprint, building age, and housing price estimation across multiple cities, while crime and noise predictions remain more challenging. The work provides a foundational benchmark and methodology for language-driven urban analytics, with practical implications for planning, policy, and environmental monitoring, alongside clear limitations related to data availability, bias, and scalability. Overall, OpenCity3D offers a scalable framework to translate language-grounded urban knowledge into actionable 3D analytics, highlighting both the potential and the need for broader datasets and debiasing strategies.

Abstract

Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io

OpenCity3D: What do Vision-Language Models know about Urban Environments?

TL;DR

Abstract

OpenCity3D: What do Vision-Language Models know about Urban Environments?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)