Table of Contents
Fetching ...

OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, Filip Biljecki

TL;DR

OpenFACADES tackles the scarcity of open, multi-attribute building data by combining street-view imagery with crowdsourced metadata and open-source multimodal language models. The authors develop a three-stage pipeline—isovist-based data integration, image reprojection, and dataset/model construction—yielding an open global dataset of 31,180 labeled images and 58,942 image-text pairs across seven cities. Fine-tuned open-source vision-language models (notably InternVL3-2B) achieve robust multi-attribute prediction and open-vocabulary captioning, often outperforming traditional CV baselines and zero-shot GPT-4o baselines, with strong generalization to unseen cities and resilience to common image degradations. The work delivers a scalable framework and dataset that enable richer urban analyses, including energy modeling, material stock assessment, and risk evaluation, advancing open, multilingual, and interpretable urban analytics.

Abstract

Building properties, such as height, usage, and material, play a crucial role in spatial data infrastructures, supporting various urban applications. Despite their importance, comprehensive building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction of objective building attributes using remote sensing and street-level imagery. However, establishing a pipeline that integrates diverse open datasets, acquires holistic building imagery, and infers comprehensive building attributes at scale remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 31,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. Further experiments confirm its superior generalization and robustness across culturally distinct region and varying image conditions.

OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

TL;DR

OpenFACADES tackles the scarcity of open, multi-attribute building data by combining street-view imagery with crowdsourced metadata and open-source multimodal language models. The authors develop a three-stage pipeline—isovist-based data integration, image reprojection, and dataset/model construction—yielding an open global dataset of 31,180 labeled images and 58,942 image-text pairs across seven cities. Fine-tuned open-source vision-language models (notably InternVL3-2B) achieve robust multi-attribute prediction and open-vocabulary captioning, often outperforming traditional CV baselines and zero-shot GPT-4o baselines, with strong generalization to unseen cities and resilience to common image degradations. The work delivers a scalable framework and dataset that enable richer urban analyses, including energy modeling, material stock assessment, and risk evaluation, advancing open, multilingual, and interpretable urban analytics.

Abstract

Building properties, such as height, usage, and material, play a crucial role in spatial data infrastructures, supporting various urban applications. Despite their importance, comprehensive building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction of objective building attributes using remote sensing and street-level imagery. However, establishing a pipeline that integrates diverse open datasets, acquires holistic building imagery, and infers comprehensive building attributes at scale remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 31,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. Further experiments confirm its superior generalization and robustness across culturally distinct region and varying image conditions.

Paper Structure

This paper contains 48 sections, 10 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: General framework for integrating multimodal crowdsourced data to establish a street-level building dataset and develop multimodal models. Data: (c) Mapillary and OpenStreetMap contributors.
  • Figure 2: Workflow for obtaining and integrating suitable multimodal crowdsourced data, combining street-level imagery from Mapillary and building information from OpenStreetMap, along with external sources such as Overture Maps and government data, to harmonize building dataset. Data: (c) Mapillary and OpenStreetMap contributors.
  • Figure 3: Pipeline demonstrating the extraction and selection of building images from street-level imagery, involving object detection, pixel coordinate transformation and reprojection, and feature-based filtering. Data: (c) Mapillary contributors.
  • Figure 4: Examples of different types of building images used as input to the vision-language model, resulting in varied responses. By generating a holistic view of individual buildings, our method facilitates a more authentic analysis and interpretation. Data: (c) Mapillary contributors.
  • Figure 5: Different label types and data collection approaches for developing a street-level building dataset. Data: (c) OSM and Mapillary contributors.
  • ...and 10 more figures