Table of Contents
Fetching ...

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang

TL;DR

UrbanVLP introduces a multi-granularity vision-language pretraining framework that fuses macro satellite imagery with micro street-view cues and calibrates automatically generated text to predict urban socioeconomic indicators. It employs a dual-branch cross-modal alignment with global and token-level contrastive losses, coupled with text generation and a PerceptionScore-based calibration to reduce hallucination and homogenization. The approach is validated on the CityView benchmark, showing consistent improvements over strong baselines and robust transferability across cities, while delivering a practical web-based system for urban planning insights. This work demonstrates the value of integrating multi-scale visual data and high-quality text in urban analytics, setting a path for richer, more interpretable urban representations.

Abstract

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction

TL;DR

UrbanVLP introduces a multi-granularity vision-language pretraining framework that fuses macro satellite imagery with micro street-view cues and calibrates automatically generated text to predict urban socioeconomic indicators. It employs a dual-branch cross-modal alignment with global and token-level contrastive losses, coupled with text generation and a PerceptionScore-based calibration to reduce hallucination and homogenization. The approach is validated on the CityView benchmark, showing consistent improvements over strong baselines and robust transferability across cities, while delivering a practical web-based system for urban planning insights. This work demonstrates the value of integrating multi-scale visual data and high-quality text in urban analytics, setting a path for richer, more interpretable urban representations.

Abstract

Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes using data-driven methods. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place. Secondly, the text generated by the precursor work UrbanCLIP, which fully utilizes the extensive knowledge of LLMs, frequently exhibits issues such as hallucination and homogenization, resulting in a lack of reliable quality. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, providing a robust guarantee for producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six socioeconomic indicator prediction tasks underscore its superior performance.
Paper Structure (37 sections, 9 equations, 19 figures, 5 tables)

This paper contains 37 sections, 9 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: USI prediction frameworks. Compared to existing arts, we present the first attempt to introduce multi-granular visual information and high-quality calibrated texts.
  • Figure 2: $R^2$ results in Beijing and Shenzhen.
  • Figure 3: Overall framework of our proposed UrbanVLP.
  • Figure 4: The procedure of CycleScore calculation.
  • Figure 5: Ablation study on $\mathtt{CityView\xspace}$-$\mathtt{Beijing}$.
  • ...and 14 more figures