OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery
Xiucheng Liang, Jinheng Xie, Tianhong Zhao, Rudi Stouffs, Filip Biljecki
TL;DR
OpenFACADES tackles the scarcity of open, multi-attribute building data by combining street-view imagery with crowdsourced metadata and open-source multimodal language models. The authors develop a three-stage pipeline—isovist-based data integration, image reprojection, and dataset/model construction—yielding an open global dataset of 31,180 labeled images and 58,942 image-text pairs across seven cities. Fine-tuned open-source vision-language models (notably InternVL3-2B) achieve robust multi-attribute prediction and open-vocabulary captioning, often outperforming traditional CV baselines and zero-shot GPT-4o baselines, with strong generalization to unseen cities and resilience to common image degradations. The work delivers a scalable framework and dataset that enable richer urban analyses, including energy modeling, material stock assessment, and risk evaluation, advancing open, multilingual, and interpretable urban analytics.
Abstract
Building properties, such as height, usage, and material, play a crucial role in spatial data infrastructures, supporting various urban applications. Despite their importance, comprehensive building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction of objective building attributes using remote sensing and street-level imagery. However, establishing a pipeline that integrates diverse open datasets, acquires holistic building imagery, and infers comprehensive building attributes at scale remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 31,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o. Further experiments confirm its superior generalization and robustness across culturally distinct region and varying image conditions.
